Chapter 9
Introduction and the JetNet Dataset
What I cannot create, I do not understand. - Richard Feynman
Simulations are critical components of the scientific process in high-energy physics. They are employed from the beginning of the experimental design, to evaluate the expected performance of a detector, all the way up to the data analysis, to determine our sensitivity to a given signal process.
For the CMS experiment, the simulation pipeline broadly involves:
- 1.
- Event generation: simulating the hard collision process, parton showering, hadronization, and underlying event interactions (see Chapter 4.1), outputting generator-level or “gen-level” particles.
- 2.
- Detector simulation: simulating the response of the detector to the particles produced in the collision and outputting the raw detector signals, or “hits” in the detector, most commonly with the Geant4 software [310].
- 3.
- Reconstruction: converting the raw detector signals into tracks and ECAL/HCAL clusters, which can then be reconstructed further into physical “objects” like jets, leptons, and missing transverse energy (see Chapter 6.4).
The first two steps are inherently stochastic processes due to the randomness of quantum mechanical decays and interactions between particles and materials. This means that the complete analytic form for the probability densities of collision and detector outputs is intractable. Instead, traditionally, the event generation relies on Monte Carlo (MC) methods to sample from probability distributions of decays and interactions, while the detector simulation propagates the resulting particles through the detector and magnetic field, simulating the random interactions and energy deposits at each step.
These methods have proven extremely effective at modeling collisions and the detector response for decades in HEP, but are computationally expensive: the full simulation of a single collision in CMS takes [311]. To maximize the physics potential of the upcoming era of high luminosity, the CMS experiment will need to reconstruct 300 billion real collision events, and simulate and reconstruct 2–3 more, a monumental task to which current methods cannot scale. Indeed, we are expected to fall 3–10 short of the necessary CPU resources to do this in HL-LHC [312].
There are two important avenues of R&D which must be explored to address this. First, performing simulation and reconstruction on GPUs: even porting a conservative fraction can improve computing capacity by 20-26%. Second, wide adoption of a fast simulation and reconstruction alternative (FastSim): 50% of CMS analyses switching to this would mean a 10 speed-up in simulations, which are in total expected to require 40% of CPU resources [312]. However, they each carry risks: of simulations not translating well to GPUs due to their sequential nature, and of inadequate FastSim performance leading to insufficient adoption by analyzers.
ML advancements in generative modeling can simultaneously improve the quality of fast simulations and naturally enable GPU-acceleration, addressing both risks. In this Part, we introduce such advancements using novel physics-informed deep learning (DL) generative models, and efficient and sensitive techniques for their validation.
We first introduce below the problem of simulating high energy jets and introduce the JetNet benchmark dataset used for all studies in the chapter. As highlighted in Chapter 7, a key contribution of this work is use of particle-cloud representations of jets, which we argued are more natural for HEP data than the more common (at the time) image- and vector-based algorithms.
To that end, we introduce two novel and highly performant ML-based approaches designed to leverage point clouds in HEP, using (1) message-passing graph neural networks (GNNs) (Chapter 10.1) and (2) attention-based transformer networks (Chapter 10.2). We will then discuss the critical problem of validating such ML-based fast simulation techniques and propose two new, sensitive methods to do so (Section 11). We will finally conclude with the outlook for these techniques in Chapter 12, discussing as well how this work has sparked a new, vibrant subfield of ML research in HEP.