15.1 JetNet
It is essential in scientific research to maintain standardized benchmark datasets following the findable, accessible, interoperable, and reproducible (FAIR) data principles [404], practices for using the data, and methods for evaluating and comparing different algorithms. This can often be difficult in high energy physics (HEP) because of the broad set of formats in which data is released and the expert knowledge required to parse the relevant information. The JetNet Python package aims to facilitate this by providing a standard interface and format for HEP datasets, integrated with PyTorch [405], to improve accessibility for both HEP experts and new or interdisciplinary researchers looking to do ML. Furthermore, by providing standard formats and implementations for evaluation metrics, results are more easily reproducible, and models are more easily assessed and benchmarked. JetNet is complementary to existing efforts for improving HEP dataset accessibility, notably the EnergyFlow library [406], with a unique focus to ML applications and integration with PyTorch.
JetNet currently provides easy-to-access and standardized interfaces for the JetNet dataset (Chapter 9.2), top quark tagging [407, 408], and quark-gluon tagging [409] reference datasets, all hosted on Zenodo [410]. It also provides standard implementations of the generative evaluation metrics discussed in Chapter 11, including Fréchet physics distance (FPD), kernel physics distance (KPD), 1-Wasserstein distance (W1), Fréchet ParticleNet distance (FPND), coverage, and minimum matching distance (MMD). Finally, JetNet implements as well custom loss functions like a differentiable version of the energy mover’s distance [325] and more general jet utilities.
JetNet has had a considerable impact in the field, demonstrated by the surge in ML and HEP research it has facilitated, including in the areas of generative adversarial networks [411], transformers [376, 380, 412], diffusion models [378, 379], and equivariant networks [68, 377], all accessing datasets, metrics, and more through the package. In particular, it has been the basis for virtually all research in the last two years on ML-based fast jet simulations [376, 378–380, 411, 412], allowing objective comparisons and benchmarking of different algorithms; indeed, a planned direction for future work is a JetNet community challenge collating all of these results. We would also like to note that from the educational perspective, we have found JetNet to be a valuable tool to involve new students quickly in ML research; both through its use in easily initiating ML projects, as well as through contributions to the software itself.
In the future, we hope to expand the package to additional dataset loaders, including detector-level data, and different machine learning backends such as JAX [413]. Improvements to the performance, such as optional lazy loading of large datasets, are also planned, as well as community challenges to benchmark algorithms as discussed above.
Acknowledgements
This chapter is, in part, a reprint of the materials as they appear in JOSS, 2023, R. Kansal; C. Pareja; Z. Hao; and J. Duarte; JetNet: A Python package for accessing open datasets and benchmarking machine learning methods in high energy physics. The dissertation author was the primary investigator and author of this paper.