Table of ContentsMy websiteDownload PDFGitHub Repository

8.2 Frequentist statistics at the LHC

8.2.1 The likelihood function and test statistics

The data model

Let us take the simplest possible case of a (one bin) counting experiment, where in our “signal region” we expect s signal events and b background events. The probability to observe n events in our signal region is distributed as a Poisson with mean s + b:

P(n; s,b) = Pois(n; s + b) = (s + b)ne(s+b) n!
(8.2.1)

Since we have only one observation but two free parameters, this experiment is underconstrained. So, let’s also add a “control region” where we expect no signal and b background events. The probability of observing m events in our control region is therefore:

Pois (m; b) = bmeb m!
(8.2.2)

Combining the two, the joint probability distribution for n and m is:

P(n,m; s,b) = Pois(n; s + b) Pois(m; b) = (s + b)ne(s+b) n! bmeb m!
(8.2.3)

This is also called the model for the data and is plotted for sample s,b values in Figure 8.1.

PIC

Figure 8.1. Sample 2D Poisson distributions.
The likelihood function

In the frequentist philosophy, however, all our parameters n, m etc. are simply fixed values of nature and, hence, don’t have a probability distribution. Instead, we work with the likelihood function, which is a function only of our parameters of interest (POIs), s in our example, and “nuisance parameters” (b), given fixed values for n and m:

L(s,b) = P(n,m; s,b) = (s + b)ne(s+b) n! bmeb m! .
(8.2.4)

Importantly, this is not a probability distribution on s and b! To derive that, we would have to use Bayes’ rule to go from P(n,m; s,b) P(s,b; n,m); however, such probability distributions don’t make sense in our frequentist world view, so we’re stuck with this likelihood formulation. Often, it’s more convenient to consider the negative log-likelihood:

ln L = ln n! + ln m! + s + 2b n ln (s + b) m ln b
(8.2.5)

The profile likelihood ratio

Fundamentally, the goal of any experiment is to test the compatibility of the observed data (n,m here) with a certain hypothesis H. We do this by mapping the data to a “test statistic” t, which is just a number, and comparing it against its distribution under H, P(t|H). Our problem, thus, boils down to 1) choosing the most effective t for testing H, and 2) obtaining P(t|H).

In the case of testing a particular signal strength, we use the “profile likelihood ratio”:

λ(s) = L(s,b^^(s)) L(ŝ,b^) ,
(8.2.6)

where ŝ,b^ are the maximum-likelihood estimates (MLEs) for s and b, given the observations n,m, and b^^(s) is the MLE for b given n,m, and s. The MLE for a parameter is simply the value of it for which the likelihood is maximized, and will be discussed in the next section. The numerator of λ(s) can be thought of as a way to “marginalize” over the nuisance parameters by simply values that maximize the likelihood for any given s, while the denominator is effectively a normalization factor, such that λ(s) 1.

Again, it’s often more convenient to use the (negative) logarithm:

ts = 2 ln λ(s)
(8.2.7)

Note that Max[λ(s)] = 1 Min[ts] = 0. λ(s) and ts are plotted for sample n,m values with n m = 10 in Figure 8.2. The maximum (minimum) of the profile likelihood ratio (ts) is at s = n m = 10, as we expect; however, as the ratio between n and m decreases — i.e., the experiment becomes more noisy — the distributions broaden, representing the reduction in sensitivity, or the higher uncertainty on the true value of s.

PIC

Figure 8.2. The profile likelihood ratio λ(s) (left) and the ts test statistic (right) for our one-bin Poisson model.

Note that the likelihood ratio and ts are also broadened due to the nuisance parameter; i.e., because we are missing information about b. This can be demonstrated by plotting them with b = m, emulating perfect information of b (Figure 8.3), and indeed, we see the functions are narrower than in Figure 8.2. More generally, increasing (decreasing) the uncertainties on the nuisance parameters will broaden (narrow) the test statistic distribution. This is which is why experimentally we want to constrain them through auxiliary measurements as much as possible.

PIC

Figure 8.3. The profile likelihood ratio λ(s) (left) and the ts test statistic (right) with b = m, demonstrating the effect of decreasing uncertainties on our nuisance parameters.
Maximum-likelihood estimates

MLEs for s and b can be found for this example by setting the derivative of the negative log-likelihood to 0 (more generally, this would require numerical minimization):

( ln L) ∂s = 1 n s + b = 0 (8.2.8) ( ln L) ∂b = 2 n s + b m b = 0 (8.2.9)

Solving simultaneously yields, as you might expect:

b^ = m,ŝ = n m,
(8.2.10)

Or just for b^^(s) from Eq. 8.2.8:

2b2 + (2s n m)b ms = 0
(8.2.11)

Plugging this back in, we can get λ(s) and ts for any given s.

Alternative test statistic

So far, our construction allows for s < 0; however, physically the number of signal events can’t be negative. Rather than incorporating this constraint in the model, it’s more convenient to impose this in the test statistic, by defining:

λ~ (s) = { L(s,b^^(s)) L(ŝ,b^) , ŝ 0. L(s,b^^(s)) L(0^,b^^(0)), ŝ < 0. ,
(8.2.12)

and

t~ s = 2 ln λ~(s)
(8.2.13)

The difference between the nominal and alternative test statistics is highlighted in Figure 8.4. For n < m, the λ~(s) = 1 and t~s = 0 values are at s = 0, since physically that is what fits best with our data (even though the math says otherwise).

PIC

Figure 8.4. Comparing the nominal vs alternative test statistic.

Next, we want to translate this to a probability distribution of t~s under a particular signal hypothesis (Hs) (i.e., an assumed value of s): p(t~s|Hs), or just p(t~s|s) for simplicity.

8.2.2 Hypothesis testing

The goal of any experiment is to test whether our data support or exclude a particular hypothesis H, and quantify the (dis)agreement. For example, to what degree did our search for the Higgs boson agree or disagree with the standard model hypothesis?

We have already discussed the process of mapping data to a scalar test statistic t that we can use to test H. However, we need to know the probability distribution of t under H to quantify the (in)consistency of the observed data with H and decide whether or not to exclude H.

We must also recognize that there’s always a chance that we will exclude H even if it’s true (called a Type I error, or a false positive), or not exclude H when it’s false (Type II error, or false negative). The probability of each is referred to as α and β, respectively. This is summarized handily in Table 8.1.

Table 8.1. Table of error types, reproduced from Ref. [65].
--------------------------------------------------------------------------------
                                 |
                                 |
                                 |           Null hypothesis (H0 ) is
       Table of error types      |----------------------|-----------------------
                                 |                      |
                                 |        True          |        False
------------------|--------------|----------------------|-----------------------
                  |              |                      |
                  |              |  Correct inference    |    Type  II error
                  |              |                      |
                  |              |                      |
     Decision     |Fail to reject|   (true negative)    |   (false negative)
                  |              |                      |
                  |              |                      |
    about null    |              |(probability =  1 − α) |  (probability = β)
                  |--------------|----------------------|-----------------------
                  |              |                      |
 hypothesis (H0 ) |              |    Type  I error      |  Correct inference
                  |              |                      |
                  |              |                      |
                  |   Reject     |   (false positive )    |    (true positive)
                  |              |                      |
                  |              |                      |
                  |              |  (probability =  α)   |(probability = 1 − β)
------------------|-------------------------------------------------------------

Before the test, we should decide on a probability of making a Type I error, α, that we are comfortable with, called a “significance level”. Typical values are 5% and 1%, although if we’re claiming something crazy like a new particle, we better be very sure this isn’t a false positive; hence, we set a much lower significance level for these tests of 3 × 107. (The significance of this value will be explained in Section 8.2.2. below.)

Deriving p(t~s|s)

We can approximate p(t~s|s) by generating several pseudo- or “toy” datasets assuming s expected signal events. In this case, this means sampling possible values for n,m from our probability model. We will continue with our simple counting experiment (Section 8.2.1.), for which such toy datasets are generated and then used to create histograms for p(t~s|s) in Figure 8.5. Note that one complication in generating these toys is that the n,m distributions from which we want to sample (Eq. 8.2.3) also depend on the nuisance parameter b. However, we see from the figure that this does not matter as much as we might expect.

PIC

Figure 8.5. Estimating p(t~s|s) through toys.

We make two important observations:

1.
p(t~s|s) does not depend on nuisance parameters as long as we have sufficiently large statistics (in this case, when b is sufficiently large). This is a key reason for basing our test statistic on the profile likelihood.
2.
In fact, p(t~s|s) doesn’t even depend on the POI s! (Again, as long as s is large.)

Reference [301] shows that, asymptotically, this distribution follows a χ2 distribution with degrees of freedom equal to the number of POIs, as illustrated in Figure 8.6.1 We can see that the asymptotic form looks accurate even for s,b as low as 5. Note that for cases where we can’t use the asymptotic form, Ref. [58] recommends using b = b^^(s) when generating toys, so that we (approximately) maximize the agreement with the hypothesis.

PIC

Figure 8.6. Asymptotic form of p(t~s|s).
p-values and significance

Now that we know the distribution of the test statistic p(t~s|Hs) p(t~s|s), we can finally test Hs with our experiment. We just need to calculate the “observed” test statistic t~sobs from our observations, and compare it to the p(t~s|s).

Example 8.2.1. Let’s say we’re testing the hypothesis of s = 10 signal events in our model and we observe n = 20,m = 5 events. We can map this observation to our test statistic t~sobs(s = 10,n obs = 20,mobs = 5) = 1.07, and see where this falls in our p(t~s|s) distribution (Figure 8.7).

PIC

Figure 8.7. Testing Hs in Example 8.2.1.

Ultimately, we care about, given p(t~s|s), the probability of obtaining t~sobs or a value more inconsistent with Hs; i.e., the green shaded region above. This is referred to as the p-value of the observation:

ps =t~obsp(t~ s|s)dt~s = 1 F(t~obs|s),
(8.2.14)

which is 0.30 for this example, where

F(t~s|s) =t~s p(t~s|s)dt~ s
(8.2.15)

is the cumulative distribution function (CDF) of t~s. We reject the hypothesis if this p-value is less than our chosen significance level α; the idea being that if Hs were true and we repeated this measurement many times, then the probability of a false-positive (p-value α) is exactly α, as we intended.

The p-value is typically converted into a significance (Z), which is the corresponding number of standard deviations away from the mean in a Gaussian distribution:

Z = Φ1(1 p),
(8.2.16)

where Φ is the CDF of the standard Gaussian. This is more easily illustrated in Figure 8.8, where φ is the standard Gaussian distribution:

PIC

Figure 8.8. Relationship between significance Z and the p-value, reproduced from Ref. [58].

The significance in Example 8.2.1 is, therefore, Φ1(1 0.30) = 0.53. We sometimes say that our measurement is (in)consistent or (in)compatible with H at the 0.53σ level, or within 1σ, etc.

Signal discovery

So far, we have been testing the signal hypothesis, but usually when searching for a particle, we instead test the “background-only” hypothesis H0 and decide whether or not to reject it. This means we want t~0obs and p(t~0|0) (Figure 8.9).2

PIC

Figure 8.9. Testing the background-only hypothesis in Example 8.2.1.

We could say for this experiment, therefore, that we exclude the background-only hypothesis at the “3 sigma” level. However, for an actual search for a new particle at the LHC, this is insufficient to claim a discovery, as the probability of a false positive at 3σ,  1/1000, is too high. The standard is instead set at 5σ for discovering new signals, corresponding to the 3 × 107 significance level quoted earlier, as we really don’t want to be making a mistake if we’re claiming to have discovered a new particle! 3σ, 4σ, and 5σ are commonly referred to as “evidence”, “observation”, and “discovery”, respectively, of the signals we’re searching for.

In summary, the framework for hypothesis testing comprises:

1.
Defining a test statistic t to map data x (in our example, x = (n,m)) to a single number.
2.
Deriving the distribution of t under the hypothesis being tested p(t|H) by sampling from “toy” datasets assuming H.
3.
Quantifying the compatibility of the observed data xobs with H with the p-value or significance Z of tobs relative to p(t|H).

This p-value / significance is what we then use to decide whether or not to exclude H. A particularly important special case of this, as discussed above, is testing the background-only hypothesis when trying to discover a signal.

8.2.3 Confidence intervals and limits

Confidence intervals using the Neyman construction

Next, we discuss going beyond hypothesis testing to setting intervals and limits for parameters of interest. The machinery from Section 8.2.2 can be extended straightforwardly to extracting “confidence intervals” for our parameters of interest (POIs): a range of values of the POIs that are allowed, based on the experiment, at a certain “confidence level” (CL), e.g. 68% or 95%. Very similar to the idea of the significance level, the CL is defined such that if we were to repeat the experiment many times, a 95%-confidence-interval must contain, or cover, the true value of the parameter 95% of the time.

This can be ensured for any given CL by solving Eq. 8.2.14 for a p-value of 1 CL:

p = 1 CL =t~sobsp(t~ s|s±)dt~s,
(8.2.17)

where s and s+ are the lower and upper limits on s, respectively.

This can be solved by scanning s and finding the values of s for which the RHS = 1 CL, as demonstrated in Figure 8.10 for the experiment in Example 8.2.1 (nobs = 20,mobs = 5). This procedure of inverting the hypothesis test by scanning along the values of the POIs is called the “Neyman construction”.

PIC

Figure 8.10. Demonstration of the Neyman construction for a 95% confidence interval for the experiment in Example 8.2.1 (nobs = 20,mobs = 5). Left: Scanning p(t~s|s) using 10,000 toys each for different values of s. Right: Converting this to a contour plot of the p-values for different t~s’s as a function of s, with the observed tsobs in red. The points at which tsobs intersects with the p-value = 0.05 contour are marked in black and signify the limits of the 95% confidence interval for s - in this case, [6.0, 25.8].

One subtlety to remember is that, in principle, we should also be scanning over the nuisance parameters (b) when estimating the p-values. However, this would be very computationally expensive so in practice, we continue to use b = b^^(s), to always (approximately) maximize the agreement with the hypothesis. Ref. [58] calls this trick “profile construction”.

Upper limits

Typically if a search does not have enough sensitivity to directly observe a new signal, we instead quote an upper limit on the signal strength. This is similar in practice to the Neyman construction for confidence intervals, solving Eq. 8.2.17 only for the upper boundary. However, an important difference is that when setting upper limits, we have to modify the test statistic so that a best-fit signal strength greater than the expected signal (ŝ > s) does not lower the compatibility with Hs:

q~(s) = { t~(s), ŝ < s. 0, ŝ s. = { 2 ln λ~(s), ŝ < s. 0, ŝ s. = { 2 ln L(s,b^^(s)) L(0,b^^(0)), ŝ < 0. 2 ln L(s,b^^(s)) L(ŝ,b^) , ŝ < s. 0, ŝ s. (8.2.18)

The upper limit test statistic q~(s) is set to 0 for ŝ > s so that this situation does not contribute to the p-value integral in Eq. 8.2.14. Figure 8.11 demonstrates this, and the difference between t~s and q~s, for different sample observations.

PIC

Figure 8.11. Comparing t~s and q~s.

Note that (as one may expect from Figure 8.11) the distribution p(q~s|s) no longer behaves like a standard χ2 but, instead, as a “half-χ2”. This is essentially a χ2 plus a delta function at 0 (since, under the signal hypothesis, on average there will be an over-fluctuation half the time, for which q~s = 0), as shown in Figure 8.12.

PIC

Figure 8.12. Comparing p(t~s|s) and p(q~s|s). p(q~s|s) asymptotically follows a half-χ2 distribution (green).

We can now revisit Example 8.2.1 to set an upper limit on s rather than a confidence interval (Figure 8.13). p(q~s|s) is shifted to the left with respect to p(t~s|s); hence, the upper limit of 24 is slightly lower than the upper bound of the 95% confidence interval we derived using t~s.

PIC

Figure 8.13. Extending the Neyman construction to an upper limit on s. Left: Scanning the upper limit test statistic distribution p(q~s|s) using 10,000 toys each for different values of s. Right: Converting this to a contour plot of the p-values for different q~s’s as a function of s, with the observed qsobs in red. The point at which qsobs intersects with the p-value = 0.05 contour is marked in black and signifies the upper limit at 95% CL.
The CLs criterion

We now introduce two conventions related to hypothesis testing and searches in particle physics. Firstly (the simple one), the POI s is usually re-parametrized as s μ s, where μ is now considered the POI, referred to as the “signal strength”, and s is a fixed value representing the number of signal events we expect to see for the nominal signal strength μ of 1. For the example in Figure 8.13, if we expect s = 10 signal events, then we would quote the upper limit as 24s = 2.4 on μ at 95% CL.

The second, important, convention is that we use a slightly different criterion for confidence intervals, called “CLs”. This is motivated by situations where we have little sensitivity to the signal we’re searching for, as in the below example.

Example 8.2.2. Let’s say we expect s = 10 and observe n = 70, m = 100. Really, what this should indicate is that our search is not at all sensitive, since our search region is completely dominated by background and, hence, we should not draw strong conclusions about the signal strength. However, if we follow the above procedure for calculating the upper limit, we get μ 0.001 at 95% CL.

This is an extremely aggressive limit on μ, where we’re excluding the nominal μ = 1 signal at a high confidence level. Given the complete lack of sensitivity to the signal, this is not a sensible result. The CLs method solves this problem by considering both the p-value of the signal + background hypothesis Hs (referred to as ps+b or just pμ for short), and the p-value of the background-only hypothesis H0 (pb), to define a new criterion:

pμ = pμ 1 pb
(8.2.19)

In cases where the signal region is completely background-dominated, the compatibility with the background-only hypothesis should be high, so pb 1 and, hence, pμ will be increased. On the other hand, for more sensitive regions, compatibility should be lower pb 0 and pμ p μ.

To be explicit, here

pb =t~obs p(t~s|0)dt~s,
(8.2.20)

where we should note that:

1.
We’re looking at the distribution of t~snot t~0 — under the background-only hypothesis, since the underlying test is of Hs, not H0; and
2.
We’re integrating up to t~obs, unlike for ps, because lower t~ means greater compatibility with the background-only hypothesis.

The effect of the CLs criterion is demonstrated in Figure 8.14 for Examples 8.2.1 and 8.2.2. In the former, the background-only distribution is shifted to the right of the s + b distribution. This indicates that the experiment is sensitive to μ and, indeed, we find pμ p μ. In Example 8.2.2, however, the search is not sensitive and, hence, the background-only and s + b distributions almost completely overlap, meaning pb 1 and pμ >> p μ.3

PIC

Figure 8.14. Demonstration of CLs criterion for Examples 8.2.1 (left) and 8.2.2 (right).

Finally, if we repeat the Neyman construction using the CLs criterion pμ instead of pμ for Example 8.2.2, we can find an upper limit of μ 1.2 at 95% CL, which is indeed a looser, more conservative, upper limit. The upper limit for Example 8.2.1 remains unchanged at μ 2.4, as we would expect.

8.2.4 Expected significances and limits

Expected significance

The focus so far has been only on evaluating the results of experiments. However, it is equally important to characterize the expected sensitivity of the experiment before running it (or before looking at the data).

Example 8.2.3. Concretely, we continue with the simple one-bin counting experiment (Section 8.2.1.). Let’s say we expect b = 10 background events and — at the nominal signal strength μ = 1s = 10 signal events. How do we tell if this experiment is at all useful for discovering this signal, i.e., does it have any sensitivity to the signal?

One way is to calculate the significance with which we expect to exclude the background-only hypothesis if the signal were, in fact, to exist. Practically, this means we are testing H0 and, hence, need p(t~0|μ = 0) as before. However, now we also need the distribution of the test statistic t~0 under the background + signal hypothesis p(t~0|μ = 1). Then, by calculating the significance for each sampled t~0 under Hμ=1, we can estimate the distribution of expected significances. This is illustrated for Example 8.2.3 in Figure 8.15.

PIC

Figure 8.15. Left: Distributions of t~0 under the background-only and background + signal hypotheses using 30,000 toys each. The median of the latter is marked in red. Right: Distribution of the significances (with respect to the background-only hypothesis) of each sampled t~0 under the signal hypothesis.

Importantly, we usually quote the median of this distribution as the expected significance, since the median is “invariant” to monotonic transformations (i.e., the median p-value will always correspond to the median Z as well, whereas the mean p-value will not correspond to the mean Z). Similarly, we quote the 16%84% and 2%98% quantiles as the ± 1σ and ± 2σ, respectively, expected significances. These quantiles correspond to the cumulative probabilities for a standard Gaussian (Figure 8.16). For Example 8.2.3, we thus find the median expected significance to be 1.83.

PIC

Figure 8.16. Gaussian quantiles, reproduced from Ref. [59].

Note that instead of converting each sampled t~ under Hμ=1 into a significance and finding the median of that distribution, as in Figure 8.15 (right), we can take advantage of the invariance of the median and directly use the significance of the median t~ under Hμ=1 (Figure 8.15, left). We will do this below for the expected limit.

Expected limits

The other figure of merit we care about in searches is the upper exclusion limit set on the signal strength. To derive the expected limit, we do the opposite of the above and ask, if the signal were not to exist, what value of μ would we expect to exclude at the 95% CL.4

This means we need:

1.
The distribution p(q~μ|μ) to solve for μ+ in Eq. 8.2.17 and be able to do the upper limit calculation (as in Section 8.2.3.);
2.
p(q~μ|0) to get the median (and other quantiles’) expected q~μobs for different signal strengths under the background-only hypothesis; and, furthermore,
3.
To scan over the different signal strengths to find the μ that results in a median p-value of 0.05 — or, rather, pμ-value (Eq. 8.2.19), since we’re using the CLs method for upper limits (Section 8.2.3.).

First, let’s look at the first two steps for just the μ = 1 signal strength in Example 8.2.3. These steps are similar to, and essentially an inversion of, the procedure for the expected significance: we’re now finding the pμ=1-value with respect to the signal + background hypothesis, for the median q~μ sampled under the background-only hypothesis. This is demonstrated in Figure 8.17.

PIC

Figure 8.17. Calculating the median expected pμ=1-value with respect to the signal + background hypothesis, for test statistics q~μ sampled under the background-only hypothesis. p(q~μ|1) and p(q~μ|0) are estimated using 30,000 toys each. Then, the median p(q~μ|0) (red) is used to calculate the pμ-value following the CLs criterion.

The key difference with respect to calculating the expected significance is step 3, in which this procedure has to be repeated for a range of signal strengths to find the value that gives a median (and ± 1σ,±2σ quantile-) pμ of 0.05. This is thus the minimum value of μ that we expect to be able to exclude at 95% CL, as shown in Figure 8.18.

PIC

Figure 8.18. Left: The expected median and ± 1σ,±2σ quantiles of pμ for different μ’s. The intersection of these with pμ = 0.05 (gray) corresponds to the expected exclusion limits. Right: The median and ± 1σ,±2σ expected limits at 95% CLs on μ.

Thus, we have our expected limits. The right plot of Figure 8.18 is colloquially known as a “Brazil-band plot”, and is the standard way of representing limits. For example, Figure 8.19 is the corresponding plot by ATLAS for the Higgs discovery (scanning over the Higgs mass).

PIC

Figure 8.19. Expected and observed 95% CLs upper limits for the SM Higgs by ATLAS in 2012, for different hypothetical Higgs masses [60].

1One can find the derivation in the reference therein; essentially, like with most things in physics, this follows from Taylor expanding around the minimum...

2Ref. [301] refers to the special case of the test statistic t~s for s = 0 as q0.

3Note that, unlike p(t~s|s), p(t~s|0) doesn’t follow a simple χ2; asymptotically, it is closer to a noncentral χ2, as will be discussed in Section 8.3.2.

495% is the standard CL for upper limits in HEP.