Introduction or: What is an analysis?

8.1 Introduction or: What is an analysis?

Once our data is collected by the CMS detector and reconstructed offline, it is analyzed to search and measure processes of interest. Typically, the raw data is entirely dominated by irrelevant background processes which we want to filter out in favor of the signal. The first step towards this is through appropriate online triggers, followed by offline selections to isolate the signal. The advent of machine learning, and later deep learning (DL), allows for more sophisticated selections, using increasingly lower-level information such as individual particles in jets, tracks and clusters, and even detector hits, as we introduced in Chapter 7.

Optimizing the event selection for all but a handful of data-driven searches requires simulations of the signal and background processes. Additionally, once the selections and phase space in which to perform the measurement have been finalized, the expected signal and background yields have to be carefully estimated, which often again necessitates simulations, as well as data-driven methods via unbiased control regions. Given the importance of simulations, it is critical to ensure sufficient quality and quantity of simulations in the HL-LHC era; Part IV will discuss efforts towards using DL.

Once we have our observations, and signal and background estimates, the final critical step is to interpret the results in a robust statistical framework. At the LHC, this is typically done using a frequentist, likelihood-based approach. In this chapter, this approach is introduced by way of simple experimental examples.

The chapter is organized as follows. Section 8.2.1 introduces the concepts of the likelihood functions and test statistics, with Section 8.2.2 discussing the framework for hypothesis testing, including $p$ -values, significances, and the statistical definition of a “discovery”. Sections 8.2.3 and 8.2.4 then describe frequentist confidence intervals and upper limits, and the important concepts of expected significances and limits, respectively. Finally, asymptotic approximations to simplify these computations are discussed Section 8.3.

The chapter is based primarily on the highly useful Refs. [58, 301]. The code for all the plots and results in this chapter is available at rkansal47.github.io/stats-for-hep; it makes extensive use of the NumPy [302], SciPy [303], and matplotlib [304] Python libraries.