11.1 Evaluation metrics for generative models
In evaluating generative models, we aim to quantify the difference between the real and generated data distributions and respectively, where data samples are typically high dimensional. Lacking tractable analytic distributions in general, this can be viewed as a form of two-sample goodness-of-fit (GOF) testing of the hypothesis using real and generated samples, and , drawn from their respective distributions. As illustrated in Ref. [337], in general, there is no “best” GOF test with power against all alternative hypotheses. Instead, we aim for a set of tests that collectively have power against the relevant alternatives we expect, and are practically most appropriate. Below, we first outline the criteria we require of our evaluation metrics, then review and discuss the suitability of possible metrics, and end with a discussion on the features to use in comparing such high-dimensional distributions, thereby motivating FPD and KPD.
Criteria for evaluation metrics in HEP
Typical failure modes in ML generative models such as normalizing flows and autoregressive models include a lack of sharpness and smearing of low-level features, while generative adversarial networks (GANs) often suffer from “mode collapse”, where they fail to capture the entire real distribution, only generating samples similar to a particular subset. Therefore, with regard to the performance of generative models, we require first and foremost that the tests be sensitive to both the quality and the diversity of the generated samples. It is critical that these tests are multivariate as well, particularly when measuring the performance of conditional models, which learn conditional distributions given input features such as those of the incoming particle into a calorimeter or originating parton of a jet, and which will be necessary for applications to LHC simulations [338]. Multivariate tests are required in order to capture the correlations between different features, including those on which such a model is conditioned. Finally, it is desirable for the test’s results to be interpretable to ensure trust in the simulations.
To facilitate a fair, objective comparison between generative models, we also require the tests to be reproducible—i.e., repeating the test on a fixed set of samples should produce the same result—and standardizable across different datasets, such that the same test can be used for multiple classes and data structures (e.g., both images and point clouds for calorimeter showers or jets). It is also desirable for the test to be reasonably efficient in terms of speed and computational resources, to minimize the burden on researchers evaluating their models.
Evaluation metrics
Having outlined criteria for our metrics, we now discuss possible metrics and their merits and limitations. The traditional method for evaluating simulations in HEP is to compare physical feature distributions using one-dimensional (1D) binned projections. This allows valuable, interpretable insight into the physics performance of these simulators. However, it is intractable to extend this binned approach to multiple distributions simultaneously, as it falls victim to the curse of dimensionality—the number of bins and samples required to retain a reasonable granularity in our estimation of the multidimensional distribution grows exponentially with the number of dimensions. Therefore, while valuable, this method is restricted to evaluating single features, losing sensitivity to correlations and conditional distributions.
Integral probability metrics and -divergences To extend to multivariate distributions, we first review measures of differences between probability distributions. The two prevalent, almost mutually exclusive,1 classes of discrepancy measures are integral probability metrics (IPMs) [340] and -divergences. An IPM , defined as
| (11.1.1) |
measures the difference in two distributions, and in Eq. (11.1.1), by using a “witness” function , out of a class of measurable, real-valued functions , which maximizes the absolute difference in its expected value over the two distributions. The choice of defines different types of IPMs. The famous Wasserstein 1-distance () [341, 342], for example, is an IPM for which in Eq. (11.1.1) is the set of all -Lipschitz functions (where is any positive constant). Maximum mean discrepancy (MMD) [343] is another popular example, where is the unit ball in a reproducing kernel Hilbert space (RKHS).
-divergences, on the other hand, are defined as
| (11.1.2) |
They calculate the average of the pointwise differences between the two distributions, and in Eq. (11.1.1), transformed by a “generating function” , weighted by . Like IPMs, different -divergences are defined by the choice of generating function. Famous examples include the Kullback-Leibler (KL) [344] and Jenson-Shannon (JS) [345, 346] divergences, which are widely used in information theory to capture the expected information loss when modeling by (or vice versa), as well as the Pearson [347] divergence and related metrics [348–350], which are ubiquitous in HEP as GOF tests.
Overall, -divergences can be powerful measures of discrepancies, with convenient information-theoretic interpretations and the advantage of coordinate invariance. However, unlike IPMs, they do not generally take into account the metric space of distributions, because of which we argue that IPMs are more useful for evaluating generative models and their respective learned distributions. An illustrative example of this is provided in Appendix D.1. IPMs can thereby be powerful metrics with which to compare different models, with measures such as and MMD able to metrize the weak convergence of probability measures [342, 351].
Additionally, on the practical side, finite-sample estimation of -divergences such as the KL and the Pearson divergences is intractable in high dimensions, generally requiring partitioning in feature space, which suffers from the curse of dimensionality as described above. References [339, 352] demonstrate more rigorously the efficacy of finite-sample estimation of IPMs, in comparison to the difficulty of estimating -divergences.
IPMs as evaluation metrics Having argued in their favor, we discuss specific IPMs and related measures, and their viability as evaluation metrics. The most famous is the Wasserstein distance [341, 342], as defined above. It is closely related to the problem of optimal transport [342]: finding the minimum “cost” to transport the mass of one distribution to another, when the cost associated with the transport between two points is the Euclidean distance between them. This metric is sensitive to both the quality and diversity of generated distributions; however, its finite-sample estimator is the optimum of a linear program—an optimization problem with linear constraints and objective [353], which, while tractable in 1D, is biased with very poor convergence in high dimensions [354]. We demonstrate these characteristics empirically in Sections 11.2 and 11.3.
A related pseudometric2 is the Fréchet, or , distance between Gaussian distributions fitted to the features of interest, which we generically call the Fréchet Gaussian distance (FGD). A form of this known as the Fréchet Inception distance (FID) [324], using the activations of the Inception v3 convolutional neural network model [355] on samples of real and generated images as its features, is currently the standard metric for evaluation in computer vision. The FID has been shown to be sensitive to both quality and mode collapse in generative models and is extremely efficient to compute; however, it has the drawback of assuming Gaussian distributions for its features. While finite-sample estimates of the FGD are biased [356], Ref. [357] introduces an effectively unbiased estimator , obtained by extrapolating from multiple finite-sample estimates to the infinite-sample value.
The final IPM we discuss is the MMD [358], for which is the unit ball in an RKHS for a chosen kernel . Intuitively, it is the distance between the mean embeddings of the two distributions in the RKHS, and it has been demonstrated to be a powerful two-sample test [343, 359]. However, generally, high sensitivity requires tuning the kernel based on the two sets of samples. For example, the traditional choice is a radial basis function kernel, where kernel bandwidth is typically chosen based on the statistics of the two samples [343]. While such a kernel has the advantage of being characteristic—i.e., it produces an injective embedding [360]—to maintain a standard and reproducible metric, we experiment instead with fixed polynomial kernels of different orders. These kernels allow access to high order moments of the distributions and have been proposed in computer vision as an alternative to FID, termed kernel Inception distance (KID) [356]. MMD has unbiased estimators [343], which have shown to converge quickly even in high dimensions [356].
Manifold estimation Another form of evaluation metrics recently popularized in computer vision involves estimating the underlying manifold of the real and generated samples. While computationally challenging, such metrics can be intuitive and allow us to disentangle the aspects of quality and diversity of the generated samples, which can be valuable in diagnosing individual failure modes of generative models. The most popular metrics are “precision” and “recall” as defined in Ref. [361]. For these, manifolds are first estimated as the union of spheres centered on each sample with radii equal to the distance to the th-nearest neighbor. Precision is defined as the number of generated points which lie within the real manifold, and recall as the number of real points within the generated manifold. Alternatives, named diversity and coverage, are proposed in Ref. [362] with a similar approach, but which use only the real manifold, and take into account the density of the spheres rather than just their union. We study the efficacy of both pairs of metrics for our problem in Sections 11.2 and 11.3.
Classifier-based metrics Finally, an alternative class of GOF tests proposed in Refs. [359, 363, 364], and most relevantly in Ref. [365] and the fast calorimeter simulation challenge [366] to evaluate simulated calorimeter showers, are based on binary classifiers trained between real and generated data. These tests have been posited to have sensitivity to both quality and diversity; however, they have significant practical and conceptual drawbacks in terms of understanding and comparing generative models.
First, deep neural networks (DNNs) are widely considered uninterpretable black boxes [367], hence it is difficult to discern which features of the generated data the network is identifying as discrepant or compatible. Second, the performance of DNNs is highly dependent on both the architecture and dataset, and it is unclear how to specify a standard architecture sensitive to all possible discrepancies for all datasets. Furthermore, training of DNNs is typically stochastic, minimizing a complex loss function with several potential local minima, and slow; hence it is sensitive to initial states and hyperparameters irrelevant to the problem, difficult to reproduce, and not efficient.
In terms of GOF testing, evaluating the performance of an individual generative model requires a more careful understanding of the null distribution of the test statistic than is proposed in Refs. [365, 366], such as by using a permutation test as suggested in Refs. [359, 363] or retraining the model numerous times between samples from the true distribution as proposed recently in Refs. [368, 369] with applications to HEP searches. However, even if such a test was performed for each model, which would itself be practically burdensome, it would remain difficult to fairly compare models, as, since different classifiers are trained for each model, this means comparing values of entirely different test statistics.3 Despite these drawbacks, we perform the classifier-based test from Refs. [365, 366] in Section 11.3 and find that, perhaps surprisingly, it is insensitive to a large class of failures typical of ML generative models.
Feature selection
We end by discussing which features to select for evaluation. Generally, for data such as calorimeter showers and jets, individual samples are extremely high dimensional, with showers and jets containing up to s of hits and particles respectively, each with its own set of features. Apart from the practical challenges of comparing distributions in this -dimensional case, often this full set of low-level features is not the most relevant for our downstream use case.
This is an issue in computer vision as well, where images are similarly high dimensional, and comparing directly the low-level, high-dimensional feature space of pixels is not practical or meaningful. Instead, the current solution is to derive salient, high-level features from the penultimate layer of a pretrained SOTA classifier.
This approach is necessary for images, for which it is difficult to define such meaningful numerical features by hand. We also tried a similar approach in Section 10.1 using the Fréchet ParticleNet distance (FPND), using the ParticleNet jet classifier to derive its features. However, one key insight and study of this work is that this may be unnecessary for HEP applications, as we have already developed a variety of meaningful, hand-engineered features such as jet observables [322, 326, 370] and shower-shape variables [371, 372]. Such variables may lead to a more efficient, more easily standardized, and interpretable test. We experiment with both types of features in Section 11.3.
1The total variation distance is the only nontrivial discrepancy measure that is both an IPM and an -divergence [339, Appendix A]; however, to our knowledge, a consistent finite-sample estimator for it does not exist (see, for example, Ref. [339, Section 5]).
2This is a pseudometric because distinct distributions can have a distance of 0 if they have the same means and covariances.
3In the case of Refs. [368, 369] the test statistic remains the same, but estimating the null distribution is even more practically challenging, as it involves multiple trainings of the classifier.