Table of ContentsMy websiteDownload PDFGitHub Repository

8.3 Asymptotic formulae

8.3.1 Asymptotic form of the MLE

So far, we have discussed how to extract meaningful statistical results from HEP experiments by making extensive use of pseudodata / toy experiments to estimate the sampling distributions of profile-likelihood-ratio-based test statistics. While this worked nicely for our simple counting experiment, generating a sufficiently large number of toys can quickly become computationally intractable for the more complex searches (and statistical combinations of searches) that are increasingly prevalent at the LHC, containing at times up to thousands of bins and nuisance parameters. This and the following section discuss a way to approximate these sampling distributions without the need for pseudodata. This was introduced in the famous “CCGV” paper [301] in 2011 and has since become the de-facto procedure at the LHC.

As hinted at previously, such as in Figures 8.6 and 8.12, the distributions p(t~μ|μ) and p(q~μ|μ) (where, in general, μμ) have similar forms regardless of the nuisance parameters (or sometimes even the POIs). This is not a coincidence: we will now derive their “asymptotic” — i.e., in the large sample limit — forms, starting first with the asymptotic form of the maximum likelihood estimator (MLE).

It is important to remember that the MLE μ^ of μ is a random variable with its own probability distribution. We can estimate it as always by sampling toys, shown in Figure 8.20 for our counting experiment (Eq. 8.2.3). One can observe that p(μ^) follows a Gaussian distribution as the number of events N increases, and indeed this becomes clear if we try to fit one to the histograms (Figure 8.21). We will now show this to be true generally, deriving the analytic distribution in Sections 8.3.1.8.3.1., and discussing the results and the important concept of the Asimov dataset for numerical estimation in Sections 8.3.1. and 8.3.1., respectively.

PIC

Figure 8.20. Distribution of the MLE of μ for different s and b produced using 30,000 toy experiments each. (Note the x-axis range is becoming narrower from the left-most to the right-most plot.)

PIC

Figure 8.21. Gaussian fits to distributions of μ^ for different s and b from Figure 8.20.
Statistics background

We first provide a lightning review of some necessary statistics concepts and results.

Definition 8.3.1. Let the negative log-likelihood (NLL) ln L(μ) l(μ). The derivative of the NLL l(μ) is called the score s(μ). It has a number of useful properties: 5

1.
Its expectation value at μ 𝔼μ=μ[s(μ)] = 0.
2.
Its variance Var[s(μ)] = 𝔼[l(μ)].

Note that the expectation value here means an average over observations which are distributed according to a particular μ, which here we’re calling the “true” μ: μ.

Definition 8.3.2. 𝔼[l(μ)] I(μ) is called the Fisher information. It quantifies the information our data contains about μ and importantly, as we’ll see, it (approximately) represents the inverse of the variance of μ^. More generally, for multiple parameters,

I ij (μ) = 𝔼[ 2l μ iμj]
(8.3.1)

is the Fisher information matrix. It is also commonly called the covariance matrix.

Theorem 8.3.1. Putting this together, by the central limit theorem [306], this means p(s(μ)) follows a normal distribution with mean 0 and variance I(μ), up to terms of order O( 1 N ):

s(μ) N >> 1N(0,I(μ )),
(8.3.2)

where N represents the data sample size.

The Fisher information

For our simple counting experiment, the Fisher information matrix I(μ,b) can be found by taking second derivatives of the NLL (Eq. 8.2.5). The Iμμ term, for example, is:

I μμ (μ,b) = 𝔼[μμl(μ,b)] = 𝔼[n s2 (μs + b)2] = 𝔼[n] s2 (μs + b)2 = (μs + b)s2 (μs + b)2 .
(8.3.3)

In the last step we use the fact that 𝔼[n] under true μ = μ, b = b, is μs + b. For the remainder of this section, I(μ,b) will always be evaluated at the true values of the parameters,6 so this can be simplified to Iμμ(μ,b) = s2 μ s+b. This is plotted in Figure 8.22, where we can the Fisher information captures the fact that as b increases, we lose sensitivity to — or information about — μ.

PIC

Figure 8.22. The Fisher information Iμμ(μ,b) for different μ and s, as a function of the expected background b.

For completeness (and since we’ll need it below), the full Fisher information matrix for our problem, repeating the steps in Eq. 8.3.3, is:

I (μ,b) = ( Iμμ Iμb I Ibb ) (μ,b) = ( s2 μ s+b s μs+b s μs+b 1 μs+b + 1 b )
(8.3.4)

Derivation

We now have enough background to derive the asymptotic form of the MLE. We do this for the 1D case by Taylor-expanding the score of μ^, l(μ^) - which we know to be = 0 - around μ:

l(μ^) = l(μ) + l(μ)(μ^ μ) + O((μ^ μ)2) = 0 (8.3.5) μ^ μ l(μ) l(μ ) N >> 1 1 I (μ)N(0, I (μ)) = N(0, 1 I (μ)), (8.3.6)

where we plugged in the distribution of l(μ) from Eq. 8.3.2, claimed l(μ) asymptotically equals its expectation value 𝔼[l(μ)] = I(μ) by the law of large numbers [307], and are ignoring the O((μ^ μ)2) term.7

For multiple parameters, I is a matrix so the variance generalized to the matrix inverse:

μ^ μ N(0, I μμ1(μ,b)),
(8.3.7)

Result

Thus, we see that μ^ asymptotically follows a normal distribution around the true μ value, μ, with a variance σμ^2 = I μμ1(μ,b), up to O(1 N) terms. Intuitively, from the definition of the Fisher information I, we can interpret this as saying that the more information we have about μ from the data, the lower the variance should be on μ^.

Continuing with our counting experiment from Section 8.2.1., inverting I from Eq. 8.3.4 gives us

σμ^ = I μμ1(μ,b) = μs + 2b s .
(8.3.8)

Note that, as we might expect, this scales as b, which is the uncertainty of our Poisson nuisance parameter b — showing mathematically why we want to keep uncertainties on nuisance parameters as low as possible. This is compared to the toy-based distributions from Section 8.3.1 in Figure 8.23 this time varying the true signal strength μ as well, where we can observe that this matches very well for large s,b, while for small values there are some discrete differences.

PIC

Figure 8.23. Asymptotic (dotted lines) and toy-based (solid lines) distributions, using 30,000 toys each, of the MLE of μ for different s, b, and true signal strengths μ.

We can also check the total per-bin errors between the asymptotic form and the toy-based distributions directly, as shown in Figure 8.24 (for μ = 1 only). Indeed, this confirms that the error scales as 1 s and 1 b, as claimed above.

PIC

Figure 8.24. Error between the sampled toy distributions, using 50,000 toys each, and the asymptotic distributions of the MLE of μ for different s and b (blue), with 1 N fits in red.
Numerical estimation and the Asimov dataset

In this section, because of the simplicity of our data model, we were able to derive the Fisher information I and, hence, the asymptotic form of μ^ analytically. In general, this is not possible and we typically have to minimize l, find its second derivatives, and solve Eq. 8.3.3 etc. numerically instead.

However, when calculating the Fisher information, how do we deal with the expectation value over the observed data (n,m in our case)? Naively, this would require averaging over a bunch of generated toy n,m values again, which defeats the purpose of using the asymptotic form of μ^!

Instead, we can switch the order of operations in Eq. 8.3.3,8 rewriting it as:

I ij (μ,b) = 𝔼[ijl(μ,b; n,m)] = ij𝔼[l(μ,b; n,m)] = ijl(μ,b; 𝔼[n], 𝔼[m]).
(8.3.9)

Importantly, this says we can find I by simply evaluating the likelihood for a dataset of observations equal to their expectation values under μ instead of averaging over the distribution of observations and then getting its second derivatives.

Definition 8.3.3. Such a dataset is called the Asimov dataset, and L(μ; 𝔼[n], 𝔼[m]) LA is referred to as the “Asimov likelihood”.9

8.3.2 Asymptotic form of the profile likelihood ratio

We can now proceed to derive the asymptotic form of the sampling distribution p(tμ|μ) of the profile likelihood ratio test statistic tμ, under a “true” signal strength of μ. This asymptotic form is extremely useful for simplifying the computation of (expected) significances, limits, and intervals; indeed, standard procedure at the LHC is to use it in lieu of toy-based, empirical distributions for p(tμ|μ).

Asymptotic form of the profile likelihood ratio

We start with deriving the asymptotic form of the profile likelihood ratio test statistic tμ (Eq. 8.2.7) by following a similar procedure to Section 8.3.1. — and using the results therein — of Taylor expanding around its minimum at μ^:10

tμ = 2 ln λ(μ) (8.3.10) = 2l(μ,b^^(μ)) + 2l(μ^,b^) (8.3.11) 2l(μ^,b^^(μ^)) + 2l(μ^,b^) b^^(μ^)=b^ so this is 0 2l(μ^,b^^(μ^))(μ μ^) l(μ^,b^)=0 2l (μ^,b^^(μ^)) (μ μ^)2 2 (8.3.12) = l(μ^,b^) (μ μ^)2 (8.3.13) = 𝔼[l (μ^,b^)] By law of large numbers (μ μ^)2 (8.3.14) = 𝔼[l (μ,b)] Since bias of MLEs   0 (μ μ^)2 (8.3.15) = Iμμ(μ,b) From definition of Fisher information (μ μ^)2 (8.3.16) tμ (μ μ^)2 σμ^2 Using σ μ^ I μμ1(μ,b) + O((μ μ^)3) + O( 1 N ). (8.3.17)

Here, just like in Eq. 8.3.6, we use the law of large numbers in Line 8.3.14 and take l(μ^,b^) to asymptotically equal its expectation value under the true parameter values μ,b: l(μ^,b^) N >> 1𝔼[l(μ^,b^)]. We then in Line 8.3.15 also use the fact that MLEs are generally unbiased estimators of the true parameter values in the large sample limit to say 𝔼[l(μ^,b^)] N >> 1𝔼[l(μ,b)]. Finally, in the last step, we use the asymptotic form of the MLE (Eq. 8.3.7).

Asymptotic form of p(tμ|μ)

Now that we have an expression for tμ, we can consider its sampling distribution. With a simple change of variables, the form of p(tμ|μ) should hopefully be evident: recognizing that μ and σμ^2 are simply constants, while μ^ we know is distributed as a Gaussian centered around μ with variance σμ^2, let’s define γ μμ^ σμ^ , so that

tμ (μ μ^)2 σμ^2 = γ2, (8.3.18) γ N(μ μ σμ^ , 1). (8.3.19)

For the special case of μ = μ, we can see that tμ is simply the square of a standard normal random variable, which is the definition of the well-known χk2 distribution with k = 1 degrees of freedom (DoF):

p(tμ|μ) χ12.
(8.3.20)

In the general case where μ may not = μ, tμ is the square of random variable with unit variance but non-zero mean. This is distributed as the similar, but perhaps less well-known, non-central chi-squared χk2(Λ), again with 1 DoF, and with a “non-centrality parameter”

Λ = γ¯2 = (μ μ σμ^ )2, (8.3.21) p(tμ|μ) χ12(Λ). (8.3.22)

The “central” vs. non-central chi-squared distributions are visualized in Figure 8.25 for k = 1. We can see that χk2(Λ) simply shifts towards the right as Λ increases (at Λ = 0 it is a regular central χ2). As Λ , χk2(Λ) becomes more and more like a normal distribution with mean Λ.11

PIC

Figure 8.25. Central χk2 and non-central χk2(Λ) distributions for Λ between 1 30 (left) and 30 300 (right).

By extending the derivation in Eq. 8.3.17 to multiple POIs, one can find the simple generalization to multiple POIs μ:

p(tμ|μ) χ k2(Λ),
(8.3.23)

where the DoF k are equal the number of POIs dim μ, and

Λ = (μ μ)T I~1(μ) (μ μ),
(8.3.24)

where I~1 is I1 restricted only to the components corresponding to the POIs.

Estimating σμ^2

The critical remaining step to understanding the asymptotic distribution of tμ is estimating σμ^2 to find the non-centrality parameter Λ in Eq. 8.3.21. We now discuss two methods to do this.

Method 1: Inverting the Fisher information / covariance matrix The first method is simply using σμ^ I μμ1(μ,b) as in Section 8.3.1.12 This is shown in Figure 8.26 for our counting experiment, using the analytic form for σμ^ from Eq. 8.3.8. We can see that this asymptotic approximation agrees well with the true distribution for some range of parameters, but can deviate significantly for others, as highlighted especially in the right plot.

PIC

Figure 8.26. Comparing the distribution p(tμ|μ) (solid) with non-central χ12(Λ) distributions (dotted) for a range of s,b,μ,μ values, with σμ^2 estimated using the inverse of the Fisher information matrix.

Interlude on Asimov dataset While we are able to find the analytic form for I μμ1(μ,b) easily for our simple counting experiment, in general it has to be calculated numerically. As introduced in Section 8.3.1., to handle the expectation value under μ,b in Eq. 8.3.1, we can make use of the Asimov dataset, where the observations nA, mA are taken to be their expectation values under μ,b, simplifying the calculation of I to Eq. 8.3.9.

Explicitly, for our counting experiment (Eq. 8.2.3), the Asimov observations are simply

nA = 𝔼[n] = μs + b, (8.3.25) mA = 𝔼[m] = b. (8.3.26)

We’ll now consider a second powerful use of the Asimov dataset to estimate σμ^2.

Method 2: The “Asimov sigma” estimate Putting together Eqs. 8.2.10 and 8.3.26, we can derive a nice property of the Asimov dataset: the MLEs μ^, b^ equal the true values μ, b:

b^ = mA = b (8.3.27) μ^ = nA mA s = μs + b b s = μ. (8.3.28)

Thus, tμ evaluated for the Asimov dataset is exactly the non-centrality parameter Λ that we are after!

tμ,A (μ μ^ σμ^ )2 = (μ μ σμ^ )2 = Λ.
(8.3.29)

While, not strictly necessary to obtain the asymptotic form for p(tμ|μ), we can also invert this to estimate σμ^, as

σA (μ μ)2 tμ,A ,
(8.3.30)

where σA is known as the “Asimov sigma”.

The asymptotic distributions using Λ = tμ,A are plotted in Figure 8.27. We see that this estimate matches the sampling distributions very well, even for cases where the covariance-matrix-estimate failed! Indeed, this is why estimating σμ^ σA is the standard in LHC analyses, and that is the method we’ll employ going forward.

Reference [301] conjectures that this is because the Fisher-information-approach is restricted only to estimating the second-order term of Eq. 8.3.17, while with tμ,A we’re matching the shape of the likelihood at the minimum which may be able capture some of the higher order terms as well.

PIC

Figure 8.27. Comparing the sampling distribution p(tμ|μ) with non-central χ12(Λ) distributions for a range of s,b,μ,μ values, with the Asimov sigma estimation for σμ^2.

Despite the pervasive use of the asymptotic formula at the LHC, it’s important to remember that it’s an approximation, only valid for large statistics. Figure 8.28 shows it breaking down for s,b 10 below.

PIC

Figure 8.28. Comparing the sampling distribution p(tμ|μ) with non-central χ12(Λ) distributions for different s,b 10, showing the break-down of the σA approximation for σμ^2 at low statistics.
The PDF and CDF

The probability distribution function (PDF) for a χk2(Λ) distribution can be found in e.g. Ref. [309] for k = 1:

p(tμ|μ) χ 12(Λ) = 1 2t μ(φ( tμ Λ) + φ( tμ + Λ)),
(8.3.31)

where φ is the PDF of a standard normal distribution. For μ = μ Λ = 0, this simplifies to:

p(tμ|μ) χ2 = 1 t μφ( tμ).
(8.3.32)

The cumulative distribution function (CDF) for k = 1 is:

F(tμ|μ) Φ(t μ Λ) + Φ( tμ + Λ) 1,
(8.3.33)

where Φ is the CDF of the standard normal distribution. For μ = μ Λ = 0, again this simplifies to:

F(tμ|μ) 2Φ( tμ) 1.
(8.3.34)

From Eq. 8.2.14, we know the p-value pμ of the observed tμobs under a signal hypothesis of Hμ is

pμ = 1 F(tμobs|μ) = 2(1 Φ( tμobs)),
(8.3.35)

with an associated significance

Z = Φ1(1 p μ) = Φ1(2Φ( tμobs 1)
(8.3.36)

Application to hypothesis testing

Let’s see how well this approximation agrees with the toy-based p-value we found in Example 8.2.1. For the same counting experiment example, where we expect s = 10 and observe nobs = 20, mobs = 5, we found the p-value for testing the μ = 1 hypothesis pμ=1 = 0.3 (and the associated significance Z = 0.52). Calculating tμobs for this example and plugging it into the asymptotic approximation from Eq. 8.3.35 gives:13

tμobs = 1.08 (8.3.37) pμ=1 = 2(1 Φ( 1.08)) = 0.3 (8.3.38) Z = 0.52. (8.3.39)

We see that it agrees exactly!

The agreement more generally, with varying s,μ,nobs,mobs, is plotted in Figure 8.29. We observe generally strong agreement, except for low n,m where, as expected, the asymptotic approximation breaks down.

PIC

Figure 8.29. Comparing the significances, as a function of the signal strength μ of the hypothesis being tested, for simple counting experiments (Eq. 8.2.3) with different s,nobs,mobs’s, derived using 30, 000 toys each (solid) to estimate the p(tμ|μ) distribution vs. the asymptotic approximation (dashed).
Summary

We have been able to find the asymptotic form for the profile-likelihood-ratio test statistic tμ (μμ^)2 σ μ^2 , which is distributed as a non-central chi-squared (χk2(Λ)) distribution. We discussed two methods for finding the non-centrality parameter Λ, out of which the Asimov sigma σA estimation generally performed better. Finally, the asymptotic formulae were applied to simple examples of hypothesis testing to check the agreement with toy-based significances. These asymptotic formulae can be extended to the alternative test statistics for positive signals t~μ and upper-limit-setting q~μ, as in Ref. [301], to simplify the calculation of both observed and expected significances, limits, and intervals.

With that, we conclude the overview of the statistical interpretation of LHC results. We will see practical applications of these concepts to searches in the high energy Higgs sector in Part V.

5See derivations in e.g. Ref. [305].

6The reason for this is discussed shortly in Section 8.3.1..

7For a more rigorous derivation, see e.g. Ref. [308].

8We are able to do this because, as we saw above, the score is linear in n for Poisson likelihoods.

9The Asimov dataset is named after Isaac Asimov, the popular science fiction author, whose book Franchise is about a supercomputer choosing a single person as the sole voter in the U.S. elections, because they can represent the entire population.

10Note: this is not a rigorous derivation; it’s just a way to motivate the final result, which is taken from Ref. [301]. (If you know of a better way, let me know!)

11More information can be found in e.g. Ref. [309].

12More generally, we’d need I~1 for Eq. 8.3.24.

13Note that we’re using tμ here, not the alternative test statistic tμ~; however, in this case since μ^ > 0, they are equivalent.