Path from the raw data collected with hardware, sensors and triggers to the high level data, where particles have been identified and particle showers were reconstructed, is long and complicated. Big collaborations like ATLAS and CMS work on reconstructing and matching physics objects from the low level streams of bits. Nowadays, in the era of data, these collaborations give an opportunity to tackle modern particle physics challenges for everyone through pieces of data they share as a part of CERN Open Data initiative.
The aim of this toy project is to learn basic toolkit for data analysis used in CERN, learn structure of ATLAS Open Data (since it quite closely resembles the real data physicists work with), and finally to learn most common pitfalls data analysts in CERN encounter when conducting state of the art analyses.
We will be focusing on the Higgs → γγ dataset. One of the interesting challenges in particle physics that can be implemented with Open Data is to identify the production mechanism of the particle. Knowing various features of the particles, like their number, pt, angular distribution, the goal is to find out whether these particles were produced from Higgs boson born in Vector Boson Fusion process, gluon-gluon fusion, associated production or through emission of top quarks.
Image("img/higgs_mechanisms.png", width=600)
(image borrowed from: https://grab-group.ethz.ch/research/higgs-boson.html)
The outline of the project is as follows:
Event selection based on https://arxiv.org/abs/1802.04146
The goal of this chapter is to practice in extracting data from root files provided by ATLAS OpenData. Fully pythonic approach implemented here could serve as a proof of concept, that flexibility of python zoo of libraries can be exploited fully without loss of the efficiency in data processing.
For the analysis we preserve directory structure from OpenData archives. Monte carlo datasets carry a label that will be useful in the next chapters, where we are going to build classifier for differentiating production mechanisms of the Higgs boson at LHC.
datasets = {
"GamGam": {
"Data": [
"data_A.GamGam",
"data_B.GamGam",
"data_C.GamGam",
"data_D.GamGam"
],
"MC": [
("mc_341081.ttH125_gamgam.GamGam", {"tag": "tt"}),
("mc_343981.ggH125_gamgam.GamGam", {"tag": "gg"}),
("mc_345041.VBFH125_gamgam.GamGam", {"tag": "VBF"}),
("mc_345318.WpH125J_Wincl_gamgam.GamGam", {"tag": "Wp"}),
("mc_345319.ZH125J_Zincl_gamgam.GamGam", {"tag": "Z"})
]
}
}
We see the event features to be grouped into the categories:
photon_*, lep_*, jet_*, ...Also we separate features in micro and macro features. Macro features are event wide and can be operated in a vectorized manner through the batch of events, while micro features depend on the particle content within the event. For instance, photon_n is a macro feature, each event has some number of photons, and photon_pt is a micro feature, because each of the n photons in the event has its own pt. At the same time, photon_lead1_pt is a pt for the leading photon, so it becomes "macro" again.
In the function below we apply cuts on event batch based on photon macro and micro features. Notice that there are no explicit loops over events in the batch, that is the power of numpy, that all the loops happen on the backend, powered by c++/fortran precompiled modules.
def photon_filter(events):
# macro_mask = True
macro_events = {}
micro_events = {}
macro_events[b"photon_n"] = events[b"photon_n"]
n_threshold = macro_events[b"photon_n"] >= 2
macro_mask = n_threshold
micro_events[b"photon_pt"] = events[b"photon_pt"][n_threshold]
micro_events[b"photon_eta"] = events[b"photon_eta"][n_threshold]
micro_events[b"photon_phi"] = events[b"photon_phi"][n_threshold]
micro_events[b"photon_E"] = events[b"photon_E"][n_threshold]
micro_events[b"photon_isTightID"] = events[b"photon_isTightID"][n_threshold]
micro_events[b"photon_trigMatched"] = events[b"photon_trigMatched"][n_threshold]
micro_events[b"photon_ptcone30"] = events[b"photon_ptcone30"][n_threshold]
micro_events[b"photon_etcone20"] = events[b"photon_etcone20"][n_threshold]
pts = micro_events[b"photon_pt"].argsort(ascending=False)
row_indices = np.arange(pts.shape[0])
lead_pts = pts[:, 0]
sublead_pts = pts[:, 1]
macro_events[b"photon_n"] = macro_events[b"photon_n"][n_threshold]
macro_events[b"photon_lead_pt"] = micro_events[b"photon_pt"][row_indices, lead_pts]
macro_events[b"photon_lead_eta"] = micro_events[b"photon_eta"][row_indices, lead_pts]
macro_events[b"photon_lead_phi"] = micro_events[b"photon_phi"][row_indices, lead_pts]
macro_events[b"photon_lead_E"] = micro_events[b"photon_E"][row_indices, lead_pts]
macro_events[b"photon_lead_isTightID"] = micro_events[b"photon_isTightID"][row_indices, lead_pts]
macro_events[b"photon_lead_trigMatched"] = micro_events[b"photon_trigMatched"][row_indices, lead_pts]
macro_events[b"photon_lead_ptcone30"] = micro_events[b"photon_ptcone30"][row_indices, lead_pts]
macro_events[b"photon_lead_etcone20"] = micro_events[b"photon_etcone20"][row_indices, lead_pts]
macro_events[b"photon_sublead_pt"] = micro_events[b"photon_pt"][row_indices, sublead_pts]
macro_events[b"photon_sublead_eta"] = micro_events[b"photon_eta"][row_indices, sublead_pts]
macro_events[b"photon_sublead_phi"] = micro_events[b"photon_phi"][row_indices, sublead_pts]
macro_events[b"photon_sublead_E"] = micro_events[b"photon_E"][row_indices, sublead_pts]
macro_events[b"photon_sublead_isTightID"] = micro_events[b"photon_isTightID"][row_indices, sublead_pts]
macro_events[b"photon_sublead_trigMatched"] = micro_events[b"photon_trigMatched"][row_indices, sublead_pts]
macro_events[b"photon_sublead_ptcone30"] = micro_events[b"photon_ptcone30"][row_indices, sublead_pts]
macro_events[b"photon_sublead_etcone20"] = micro_events[b"photon_etcone20"][row_indices, sublead_pts]
macro_filter = ( (macro_events[b"photon_lead_pt"] > 35000)
&
(macro_events[b"photon_sublead_pt"] > 25000)
&
(macro_events[b"photon_lead_isTightID"])
&
(macro_events[b"photon_sublead_isTightID"])
&
(macro_events[b"photon_lead_trigMatched"])
&
(macro_events[b"photon_sublead_trigMatched"])
&
(macro_events[b"photon_lead_ptcone30"] < 0.065)
&
(macro_events[b"photon_lead_etcone20"] < 0.065)
&
(macro_events[b"photon_sublead_ptcone30"] < 0.065)
&
(macro_events[b"photon_sublead_etcone20"] < 0.065)
&
( (np.absolute(macro_events[b"photon_lead_eta"]) < 1.37)
|
( (np.absolute(macro_events[b"photon_lead_eta"]) > 1.52)
& (np.absolute(macro_events[b"photon_lead_eta"]) < 2.37)
)
)
& ( (np.absolute(macro_events[b"photon_sublead_eta"]) < 1.37)
|
( (np.absolute(macro_events[b"photon_sublead_eta"]) > 1.52)
& (np.absolute(macro_events[b"photon_sublead_eta"]) < 2.37)
)
)
)
dict_apply_mask(macro_events, macro_filter)
macro_mask[macro_mask] = macro_filter
macro_events[b"h_mass"] = np.sqrt(2.*(
macro_events[b"photon_lead_E"]*macro_events[b"photon_sublead_E"] \
- macro_events[b"photon_lead_pt"]*macro_events[b"photon_sublead_pt"]*atlas_two_cosine(macro_events, b"photon_lead", b"photon_sublead")
)
)
mass_cutoff = (macro_events[b"photon_lead_pt"]/macro_events[b"h_mass"] > 0.35) \
& (macro_events[b"photon_sublead_pt"]/macro_events[b"h_mass"] > 0.25) \
& (macro_events[b"h_mass"] >= 105000.) \
& (macro_events[b"h_mass"] <= 160000)
dict_apply_mask(macro_events, mass_cutoff)
macro_mask[macro_mask] = mass_cutoff
return macro_events, macro_mask
#photon_filter(test_events)
In addition to photon features, that are relevant for computing the Higgs boson mass, we also compute weights. We normalize weights to luminosity $1~pb^{-1}$, and rescale it later to the proper ATLAS OpenData luminosity.
def process_event_batch(events, bin_edges, weighted=False):
filtered_photons, photon_macro_mask = photon_filter(events)
h_masses = filtered_photons[b"h_mass"]/1000
if weighted:
total_weights = events[b"SumWeights"][0]
x_section = events[b"XSection"][0]
weights = ( events[b"mcWeight"]
* events[b'scaleFactor_PILEUP']
* events[b'scaleFactor_ELE']
* events[b'scaleFactor_MUON']
* events[b'scaleFactor_PHOTON']
* events[b'scaleFactor_TAU']
* events[b'scaleFactor_BTAG']
* events[b'scaleFactor_LepTRIGGER']
* events[b'scaleFactor_PhotonTRIGGER']
)[photon_macro_mask]/total_weights*x_section
else:
weights = None
hist = np.histogram(h_masses, bin_edges, weights=weights, density=False)[0]
return hist
#process_event_batch(test_events, bin_edges)
We are going to fill in the histogram between 105 GeV and 160 GeV with a binsize 1 GeV.
bin_edges = np.linspace(105., 160., 55, dtype=np.float64)
mc_hists = histogram_per_file(datasets, bin_edges, ["MC"])
data_hists = histogram_per_file(datasets, bin_edges, ["Data"])
A nice observation, 100 000 events can be processed in 2 seconds. ~4 million events take 2 minutes to process. We could improve by increasing batch size because the major overhead is coming from data transport through xrootd. Moreover multithreading can be applied since analyses of batches are independent.
Below we show histograms collected per file with higgs mass at luminosity $1~pb^{-1}$. Then we add up all the data events to plot them later as data points, and we add up mc events to plot them on the same figure. Before plotting we show aggregated histograms rescaled to the proper luminosity.
mc_hists
data_hists
data_hist = sum(data_hists.values())
mc_hist = sum(mc_hists.values())
We also define luminosity as $10.6~fb^{-1}$.
L = 1000*10.6
mc_hist
data_hist
Finally we arrive to the plot with mass of Higgs boson computed from Higgs di-photon events and also datapoints. It turns out that most of the contribution to the data statistics constitute the background, but in MC data we have only signal datasets. That is why data does not match MC properly. Since the goal of this notebook was not the re-discovery of the Higgs, but acquaintance with data, features and relevant python libraries, we are going to wrap up here and follow to the next chapter. There we are going to make use of the workflow we've learned in this chapter and set up the feature processing pipeline for the production mechanism classification problem.