FYS5555 2020

Final Project: ATLAS Open Data

«Classification of Higgs production mechanism based on Hγγ dataset»

Path from the raw data collected with hardware, sensors and triggers to the high level data, where particles have been identified and particle showers were reconstructed, is long and complicated. Big collaborations like ATLAS and CMS work on reconstructing and matching physics objects from the low level streams of bits. Nowadays, in the era of data, these collaborations give an opportunity to tackle modern particle physics challenges for everyone through pieces of data they share as a part of CERN Open Data initiative.

The aim of this toy project is to learn basic toolkit for data analysis used in CERN, learn structure of ATLAS Open Data (since it quite closely resembles the real data physicists work with), and finally to learn most common pitfalls data analysts in CERN encounter when conducting state of the art analyses.

We will be focusing on the Higgs → γγ dataset. One of the interesting challenges in particle physics that can be implemented with Open Data is to identify the production mechanism of the particle. Knowing various features of the particles, like their number, pt, angular distribution, the goal is to find out whether these particles were produced from Higgs boson born in Vector Boson Fusion process, gluon-gluon fusion, associated production or through emission of top quarks.

In [2]:
Image("img/higgs_mechanisms.png", width=600)
Out[2]:

(image borrowed from: https://grab-group.ethz.ch/research/higgs-boson.html)

The outline of the project is as follows:

  1. Data investigation. Comparing MC and Data datasets on example of Higgs di-photon peak analysis.
  2. Exploratory data analysis. Focusing on MC data, which is labeled, we have a very brief look at the features that will be included or excluded from the feature set required for discrimination of the production mechanisms.
  3. Feature processing pipeline. Here we describe how we extract, transform and save features that will be used later for classification.
  4. Modelling. The playground of model building. We learn common pitfalls and review possible ways to approach them from the literature.

Data investigation. Higgs di-photon peak

Event selection based on https://arxiv.org/abs/1802.04146

The goal of this chapter is to practice in extracting data from root files provided by ATLAS OpenData. Fully pythonic approach implemented here could serve as a proof of concept, that flexibility of python zoo of libraries can be exploited fully without loss of the efficiency in data processing.

For the analysis we preserve directory structure from OpenData archives. Monte carlo datasets carry a label that will be useful in the next chapters, where we are going to build classifier for differentiating production mechanisms of the Higgs boson at LHC.

In [4]:
datasets = {
    "GamGam": {
        "Data": [
            "data_A.GamGam",
            "data_B.GamGam",
            "data_C.GamGam",
            "data_D.GamGam"
        ],
        "MC": [
            ("mc_341081.ttH125_gamgam.GamGam", {"tag": "tt"}),
            ("mc_343981.ggH125_gamgam.GamGam", {"tag": "gg"}),
            ("mc_345041.VBFH125_gamgam.GamGam", {"tag": "VBF"}),
            ("mc_345318.WpH125J_Wincl_gamgam.GamGam", {"tag": "Wp"}),
            ("mc_345319.ZH125J_Zincl_gamgam.GamGam", {"tag": "Z"})

        ]
    }
}

We see the event features to be grouped into the categories:

  • generic features: MET, triggers. weight...
  • particle related features: photon_*, lep_*, jet_*, ...

Also we separate features in micro and macro features. Macro features are event wide and can be operated in a vectorized manner through the batch of events, while micro features depend on the particle content within the event. For instance, photon_n is a macro feature, each event has some number of photons, and photon_pt is a micro feature, because each of the n photons in the event has its own pt. At the same time, photon_lead1_pt is a pt for the leading photon, so it becomes "macro" again.

In the function below we apply cuts on event batch based on photon macro and micro features. Notice that there are no explicit loops over events in the batch, that is the power of numpy, that all the loops happen on the backend, powered by c++/fortran precompiled modules.

In [9]:
def photon_filter(events):
    # macro_mask = True
    
    macro_events = {}
    micro_events = {}
    
    macro_events[b"photon_n"] = events[b"photon_n"]
    n_threshold = macro_events[b"photon_n"] >= 2
    macro_mask = n_threshold
    
    micro_events[b"photon_pt"] = events[b"photon_pt"][n_threshold]
    micro_events[b"photon_eta"] = events[b"photon_eta"][n_threshold]
    micro_events[b"photon_phi"] = events[b"photon_phi"][n_threshold]
    micro_events[b"photon_E"] = events[b"photon_E"][n_threshold]
    micro_events[b"photon_isTightID"] = events[b"photon_isTightID"][n_threshold]
    micro_events[b"photon_trigMatched"] = events[b"photon_trigMatched"][n_threshold]
    micro_events[b"photon_ptcone30"] = events[b"photon_ptcone30"][n_threshold]
    micro_events[b"photon_etcone20"] = events[b"photon_etcone20"][n_threshold]
    
    
    pts = micro_events[b"photon_pt"].argsort(ascending=False)
    row_indices = np.arange(pts.shape[0])
    lead_pts = pts[:, 0]
    sublead_pts = pts[:, 1]
    
    
    macro_events[b"photon_n"] = macro_events[b"photon_n"][n_threshold]
    macro_events[b"photon_lead_pt"] = micro_events[b"photon_pt"][row_indices, lead_pts]
    macro_events[b"photon_lead_eta"] = micro_events[b"photon_eta"][row_indices, lead_pts]
    macro_events[b"photon_lead_phi"] = micro_events[b"photon_phi"][row_indices, lead_pts]
    macro_events[b"photon_lead_E"] = micro_events[b"photon_E"][row_indices, lead_pts]
    macro_events[b"photon_lead_isTightID"] = micro_events[b"photon_isTightID"][row_indices, lead_pts]
    macro_events[b"photon_lead_trigMatched"] = micro_events[b"photon_trigMatched"][row_indices, lead_pts]
    macro_events[b"photon_lead_ptcone30"] = micro_events[b"photon_ptcone30"][row_indices, lead_pts]
    macro_events[b"photon_lead_etcone20"] = micro_events[b"photon_etcone20"][row_indices, lead_pts]
    macro_events[b"photon_sublead_pt"] = micro_events[b"photon_pt"][row_indices, sublead_pts]
    macro_events[b"photon_sublead_eta"] = micro_events[b"photon_eta"][row_indices, sublead_pts]
    macro_events[b"photon_sublead_phi"] = micro_events[b"photon_phi"][row_indices, sublead_pts]
    macro_events[b"photon_sublead_E"] = micro_events[b"photon_E"][row_indices, sublead_pts]
    macro_events[b"photon_sublead_isTightID"] = micro_events[b"photon_isTightID"][row_indices, sublead_pts]
    macro_events[b"photon_sublead_trigMatched"] = micro_events[b"photon_trigMatched"][row_indices, sublead_pts]
    macro_events[b"photon_sublead_ptcone30"] = micro_events[b"photon_ptcone30"][row_indices, sublead_pts]
    macro_events[b"photon_sublead_etcone20"] = micro_events[b"photon_etcone20"][row_indices, sublead_pts]
    
    macro_filter = (  (macro_events[b"photon_lead_pt"] > 35000)
                    &
                      (macro_events[b"photon_sublead_pt"] > 25000)
                    &
                      (macro_events[b"photon_lead_isTightID"])
                    &
                      (macro_events[b"photon_sublead_isTightID"])
                    &
                      (macro_events[b"photon_lead_trigMatched"])
                    &
                      (macro_events[b"photon_sublead_trigMatched"])
                    & 
                      (macro_events[b"photon_lead_ptcone30"] < 0.065)
                    & 
                      (macro_events[b"photon_lead_etcone20"] < 0.065)
                    & 
                      (macro_events[b"photon_sublead_ptcone30"] < 0.065)
                    & 
                      (macro_events[b"photon_sublead_etcone20"] < 0.065)
                    & 
                      (  (np.absolute(macro_events[b"photon_lead_eta"]) < 1.37) 
                       | 
                         (  (np.absolute(macro_events[b"photon_lead_eta"]) > 1.52)
                          & (np.absolute(macro_events[b"photon_lead_eta"]) < 2.37)
                         )
                      )
                    & (  (np.absolute(macro_events[b"photon_sublead_eta"]) < 1.37) 
                       | 
                         (  (np.absolute(macro_events[b"photon_sublead_eta"]) > 1.52)
                          & (np.absolute(macro_events[b"photon_sublead_eta"]) < 2.37)
                         )
                      )
                   )
    
    dict_apply_mask(macro_events, macro_filter)
    macro_mask[macro_mask] = macro_filter

    macro_events[b"h_mass"] = np.sqrt(2.*(
                                            macro_events[b"photon_lead_E"]*macro_events[b"photon_sublead_E"] \
                                            - macro_events[b"photon_lead_pt"]*macro_events[b"photon_sublead_pt"]*atlas_two_cosine(macro_events, b"photon_lead", b"photon_sublead")
                                        )
                                     )
    
    mass_cutoff =   (macro_events[b"photon_lead_pt"]/macro_events[b"h_mass"] > 0.35) \
                  & (macro_events[b"photon_sublead_pt"]/macro_events[b"h_mass"] > 0.25) \
                  & (macro_events[b"h_mass"] >= 105000.) \
                  & (macro_events[b"h_mass"] <= 160000)
    
    dict_apply_mask(macro_events, mass_cutoff)
    macro_mask[macro_mask] = mass_cutoff
    
    return macro_events, macro_mask
#photon_filter(test_events)

In addition to photon features, that are relevant for computing the Higgs boson mass, we also compute weights. We normalize weights to luminosity $1~pb^{-1}$, and rescale it later to the proper ATLAS OpenData luminosity.

In [10]:
def process_event_batch(events, bin_edges, weighted=False):
    filtered_photons, photon_macro_mask = photon_filter(events)
    h_masses = filtered_photons[b"h_mass"]/1000
    if weighted:
        total_weights = events[b"SumWeights"][0]
        x_section = events[b"XSection"][0]
        weights = (  events[b"mcWeight"]
                   * events[b'scaleFactor_PILEUP'] 
                   * events[b'scaleFactor_ELE'] 
                   * events[b'scaleFactor_MUON'] 
                   * events[b'scaleFactor_PHOTON'] 
                   * events[b'scaleFactor_TAU'] 
                   * events[b'scaleFactor_BTAG'] 
                   * events[b'scaleFactor_LepTRIGGER'] 
                   * events[b'scaleFactor_PhotonTRIGGER']
                  )[photon_macro_mask]/total_weights*x_section
    else:
        weights = None
    hist = np.histogram(h_masses, bin_edges, weights=weights, density=False)[0]
    return hist
#process_event_batch(test_events, bin_edges)

We are going to fill in the histogram between 105 GeV and 160 GeV with a binsize 1 GeV.

In [11]:
bin_edges = np.linspace(105., 160., 55, dtype=np.float64)
In [13]:
mc_hists = histogram_per_file(datasets, bin_edges, ["MC"])
Process:  GamGam
Type:  Data
Type:  MC
File:  mc_341081.ttH125_gamgam.GamGam
Total Num Events:  576491
Processed:  100000 ; Time per batch:  3.273832584032789
Processed:  200000 ; Time per batch:  2.0901889570523053
Processed:  300000 ; Time per batch:  2.019922111881897
Processed:  400000 ; Time per batch:  2.5594259831123054
Processed:  500000 ; Time per batch:  8.562512199860066
Processed:  600000 ; Time per batch:  1.3569668761920184
Total time:  19.86416905093938 ; Median time per batch:  2.3248074700823054
File:  mc_343981.ggH125_gamgam.GamGam
Total Num Events:  1054711
Processed:  100000 ; Time per batch:  3.693176604108885
Processed:  200000 ; Time per batch:  2.2428605949971825
Processed:  300000 ; Time per batch:  1.3741464968770742
Processed:  400000 ; Time per batch:  2.1243397961370647
Processed:  500000 ; Time per batch:  2.37794060097076
Processed:  600000 ; Time per batch:  1.3524941429495811
Processed:  700000 ; Time per batch:  2.2173393280245364
Processed:  800000 ; Time per batch:  2.6964502590708435
Processed:  900000 ; Time per batch:  1.4090353539213538
Processed:  1000000 ; Time per batch:  1.8762797510717064
Processed:  1100000 ; Time per batch:  0.9314408109057695
Total time:  22.29723080713302 ; Median time per batch:  2.1243397961370647
File:  mc_345041.VBFH125_gamgam.GamGam
Total Num Events:  497468
Processed:  100000 ; Time per batch:  5.631897657178342
Processed:  200000 ; Time per batch:  2.0497449268586934
Processed:  300000 ; Time per batch:  2.2447489160113037
Processed:  400000 ; Time per batch:  1.3242727490141988
Processed:  500000 ; Time per batch:  1.511085090925917
Total time:  12.76316564809531 ; Median time per batch:  2.0497449268586934
File:  mc_345318.WpH125J_Wincl_gamgam.GamGam
Total Num Events:  113765
Processed:  100000 ; Time per batch:  1.8206009629648179
Processed:  200000 ; Time per batch:  0.4103519378695637
Total time:  2.2314883389044553 ; Median time per batch:  1.1154764504171908
File:  mc_345319.ZH125J_Zincl_gamgam.GamGam
Total Num Events:  230900
Processed:  100000 ; Time per batch:  2.369850924005732
Processed:  200000 ; Time per batch:  1.7537660240195692
Processed:  300000 ; Time per batch:  0.593960426049307
Total time:  4.718156240181997 ; Median time per batch:  1.7537660240195692
In [14]:
data_hists = histogram_per_file(datasets, bin_edges, ["Data"])
Process:  GamGam
Type:  Data
File:  data_A.GamGam
Total Num Events:  430344
Processed:  100000 ; Time per batch:  2.957507056882605
Processed:  200000 ; Time per batch:  2.4960478220600635
Processed:  300000 ; Time per batch:  2.146462333854288
Processed:  400000 ; Time per batch:  1.2710055720526725
Processed:  500000 ; Time per batch:  0.6249831181485206
Total time:  9.49666281696409 ; Median time per batch:  2.146462333854288
File:  data_B.GamGam
Total Num Events:  1528717
Processed:  100000 ; Time per batch:  4.2246646550484
Processed:  200000 ; Time per batch:  3.630030795931816
Processed:  300000 ; Time per batch:  1.3716805521398783
Processed:  400000 ; Time per batch:  2.122378665022552
Processed:  500000 ; Time per batch:  2.7519458308815956
Processed:  600000 ; Time per batch:  1.6718853670172393
Processed:  700000 ; Time per batch:  2.113810244947672
Processed:  800000 ; Time per batch:  2.370707865105942
Processed:  900000 ; Time per batch:  1.3582110649440438
Processed:  1000000 ; Time per batch:  3.4243466190528125
Processed:  1100000 ; Time per batch:  3.521174421068281
Processed:  1200000 ; Time per batch:  1.4221787438727915
Processed:  1300000 ; Time per batch:  2.541199133032933
Processed:  1400000 ; Time per batch:  2.4681820950936526
Processed:  1500000 ; Time per batch:  1.3522421319503337
Processed:  1600000 ; Time per batch:  0.6481136940419674
Total time:  36.99424342508428 ; Median time per batch:  2.246543265064247
File:  data_C.GamGam
Total Num Events:  2237187
Processed:  100000 ; Time per batch:  3.926153526874259
Processed:  200000 ; Time per batch:  7.073996527120471
Processed:  300000 ; Time per batch:  1.9510420409496874
Processed:  400000 ; Time per batch:  5.319479317869991
Processed:  500000 ; Time per batch:  6.436766006052494
Processed:  600000 ; Time per batch:  1.9694347770418972
Processed:  700000 ; Time per batch:  3.4564639790914953
Processed:  800000 ; Time per batch:  2.296430556802079
Processed:  900000 ; Time per batch:  1.3495494150556624
Processed:  1000000 ; Time per batch:  2.211353654973209
Processed:  1100000 ; Time per batch:  3.236614224035293
Processed:  1200000 ; Time per batch:  1.3631930989213288
Processed:  1300000 ; Time per batch:  2.1597060181666166
Processed:  1400000 ; Time per batch:  2.2664976730011404
Processed:  1500000 ; Time per batch:  1.354186096927151
Processed:  1600000 ; Time per batch:  2.4892491861246526
Processed:  1700000 ; Time per batch:  5.082142660859972
Processed:  1800000 ; Time per batch:  1.5736373830586672
Processed:  1900000 ; Time per batch:  2.9936846329364926
Processed:  2000000 ; Time per batch:  2.9228627749253064
Processed:  2100000 ; Time per batch:  1.41682945122011
Processed:  2200000 ; Time per batch:  1.5762466418091208
Processed:  2300000 ; Time per batch:  0.6803281081374735
Total time:  65.107905466808 ; Median time per batch:  2.2664976730011404
File:  data_D.GamGam
Total Num Events:  3602176
Processed:  100000 ; Time per batch:  6.8443712829612195
Processed:  200000 ; Time per batch:  9.260830905055627
Processed:  300000 ; Time per batch:  19.006831076927483
Processed:  400000 ; Time per batch:  1.3225739758927375
Processed:  500000 ; Time per batch:  10.343360111117363
Processed:  600000 ; Time per batch:  5.5268969759345055
Processed:  700000 ; Time per batch:  1.5096590199973434
Processed:  800000 ; Time per batch:  4.239757927134633
Processed:  900000 ; Time per batch:  2.3611507168971
Processed:  1000000 ; Time per batch:  2.443563654087484
Processed:  1100000 ; Time per batch:  1.3094460659194738
Processed:  1200000 ; Time per batch:  2.19370813597925
Processed:  1300000 ; Time per batch:  2.196698415093124
Processed:  1400000 ; Time per batch:  1.329906899947673
Processed:  1500000 ; Time per batch:  2.359392515849322
Processed:  1600000 ; Time per batch:  2.2051101431716233
Processed:  1700000 ; Time per batch:  1.380728310905397
Processed:  1800000 ; Time per batch:  2.1504601801279932
Processed:  1900000 ; Time per batch:  2.6695830868557096
Processed:  2000000 ; Time per batch:  4.728721522958949
Processed:  2100000 ; Time per batch:  1.4027611480560154
Processed:  2200000 ; Time per batch:  2.752245787065476
Processed:  2300000 ; Time per batch:  3.232537433039397
Processed:  2400000 ; Time per batch:  1.3538068979978561
Processed:  2500000 ; Time per batch:  2.475524908863008
Processed:  2600000 ; Time per batch:  2.864899810170755
Processed:  2700000 ; Time per batch:  1.7699140037875623
Processed:  2800000 ; Time per batch:  2.9390002770815045
Processed:  2900000 ; Time per batch:  2.431639362126589
Processed:  3000000 ; Time per batch:  4.5829274950083345
Processed:  3100000 ; Time per batch:  1.383141626836732
Processed:  3200000 ; Time per batch:  3.1568072540685534
Processed:  3300000 ; Time per batch:  3.518081624060869
Processed:  3400000 ; Time per batch:  1.3794054239988327
Processed:  3500000 ; Time per batch:  2.3002057268749923
Processed:  3600000 ; Time per batch:  1.5190929011441767
Processed:  3700000 ; Time per batch:  0.17736856290139258
Total time:  124.6264697688166 ; Median time per batch:  2.3611507168971
Type:  MC

A nice observation, 100 000 events can be processed in 2 seconds. ~4 million events take 2 minutes to process. We could improve by increasing batch size because the major overhead is coming from data transport through xrootd. Moreover multithreading can be applied since analyses of batches are independent.

Below we show histograms collected per file with higgs mass at luminosity $1~pb^{-1}$. Then we add up all the data events to plot them later as data points, and we add up mc events to plot them on the same figure. Before plotting we show aggregated histograms rescaled to the proper luminosity.

In [15]:
mc_hists
Out[15]:
{'GamGam:MC:mc_341081.ttH125_gamgam.GamGam': array([ 5.21269920e-12,  6.56134782e-12,  1.01628289e-11,  8.37056639e-12,
         2.35711225e-11,  6.29098509e-12,  0.00000000e+00,  2.31103717e-11,
         4.06991394e-11,  5.46314057e-11,  3.66656636e-11,  1.73526023e-10,
         4.69358986e-10,  9.49041068e-10,  2.25394389e-09,  4.67781192e-09,
         9.46759715e-09,  1.61016687e-08,  2.49771261e-08,  3.06442187e-08,
         2.77014545e-08,  1.99048618e-08,  1.16149170e-08,  6.14027229e-09,
         3.18150128e-09,  1.42953915e-09,  7.46313233e-10,  3.78136633e-10,
         1.66460623e-10,  3.96855881e-11,  5.87530025e-11,  3.27577965e-11,
         2.69526623e-11,  1.51612056e-11,  1.60120806e-11,  3.51363383e-12,
         0.00000000e+00,  8.84092799e-12, -1.67155179e-12,  7.27951033e-12,
         6.45172804e-12,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        -6.55653309e-12,  1.42037493e-11,  2.72493139e-12,  0.00000000e+00,
         0.00000000e+00,  6.51745324e-12,  4.20641300e-12, -1.07860387e-11,
         0.00000000e+00,  0.00000000e+00]),
 'GamGam:MC:mc_343981.ggH125_gamgam.GamGam': array([6.44175771e-08, 8.43441406e-08, 3.16345467e-07, 1.13340274e-07,
        9.17407803e-08, 2.33867297e-07, 2.73550960e-07, 7.58206234e-07,
        8.52894900e-07, 1.22730124e-06, 2.85892479e-06, 7.68758855e-06,
        1.62533983e-05, 3.67395069e-05, 8.99566853e-05, 1.85554984e-04,
        3.61419154e-04, 6.18977680e-04, 9.02980959e-04, 1.08311531e-03,
        9.76468509e-04, 7.33089400e-04, 4.36087954e-04, 2.37699540e-04,
        1.23833714e-04, 5.80296037e-05, 2.44473922e-05, 1.11108238e-05,
        4.82084579e-06, 1.68327824e-06, 8.42963345e-07, 6.28991984e-07,
        0.00000000e+00, 2.30531441e-07, 2.64175469e-07, 1.52213033e-07,
        4.56930138e-08, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        6.47269189e-08, 6.58328645e-08, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00]),
 'GamGam:MC:mc_345041.VBFH125_gamgam.GamGam': array([1.64019540e-08, 0.00000000e+00, 1.09754712e-08, 8.62806360e-09,
        4.91802465e-08, 1.84600060e-08, 2.51826080e-08, 6.23408916e-08,
        4.12726084e-08, 1.33216428e-07, 3.27710787e-07, 7.24731493e-07,
        1.67086466e-06, 3.23582941e-06, 8.55316614e-06, 1.69941925e-05,
        3.22200385e-05, 5.43436199e-05, 8.04310257e-05, 9.59503413e-05,
        8.73195568e-05, 6.33732343e-05, 3.87439941e-05, 2.10641738e-05,
        1.09114044e-05, 5.47544187e-06, 2.12944724e-06, 1.13987335e-06,
        4.87263605e-07, 2.60064553e-07, 1.11256668e-07, 6.50325092e-08,
        4.92873369e-08, 5.87169779e-08, 1.80152711e-08, 2.08092388e-08,
        0.00000000e+00, 3.72310751e-08, 1.80807547e-08, 6.50470611e-09,
        0.00000000e+00, 8.47649062e-09, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 9.92440619e-09,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00]),
 'GamGam:MC:mc_345318.WpH125J_Wincl_gamgam.GamGam': array([ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         4.51115811e-09,  5.62805402e-09,  0.00000000e+00,  0.00000000e+00,
         1.54036375e-08,  3.60623540e-08,  5.85909596e-08,  1.23973379e-07,
         3.41282739e-07,  7.90439572e-07,  1.53157691e-06,  3.14664740e-06,
         5.89911008e-06,  9.59106637e-06,  1.47327321e-05,  1.72625519e-05,
         1.62735332e-05,  1.14802679e-05,  7.31041655e-06,  4.23600977e-06,
         1.98164162e-06,  1.09760003e-06,  4.60255251e-07,  1.52416760e-07,
         7.90914783e-08,  4.93018888e-08,  3.56885721e-08,  4.51764208e-08,
         9.96078597e-09,  1.84081728e-08,  0.00000000e+00,  7.33416528e-09,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00, -1.01235855e-08,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  1.01645128e-08,
         1.37806637e-08,  0.00000000e+00]),
 'GamGam:MC:mc_345319.ZH125J_Zincl_gamgam.GamGam': array([3.52667451e-09, 0.00000000e+00, 1.29050015e-09, 8.66052297e-09,
        5.04266051e-09, 4.87604712e-09, 4.91950125e-09, 1.26775861e-08,
        1.87758005e-08, 1.15247039e-08, 8.57251390e-08, 9.89075417e-08,
        3.13821964e-07, 6.45240952e-07, 1.40888579e-06, 3.04210727e-06,
        5.49730993e-06, 9.46533669e-06, 1.38745916e-05, 1.60773652e-05,
        1.49435259e-05, 1.07596807e-05, 6.78465858e-06, 3.96171163e-06,
        1.95760913e-06, 9.39309757e-07, 5.27385055e-07, 1.70658495e-07,
        7.65885488e-08, 2.93021003e-08, 2.42580427e-08, 1.35123628e-08,
        1.73458830e-08, 5.89716365e-09, 9.12405085e-09, 0.00000000e+00,
        4.60204319e-09, 0.00000000e+00, 4.80940798e-09, 0.00000000e+00,
        5.85350790e-09, 0.00000000e+00, 0.00000000e+00, 2.23008101e-09,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.06301741e-08,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 9.73886927e-09])}
In [16]:
data_hists
Out[16]:
{'GamGam:Data:data_A.GamGam': array([22., 24., 15., 13., 14., 17., 20., 15., 12., 13.,  7.,  7., 11.,
        18., 13., 11., 12., 12., 15., 11.,  9., 12.,  7.,  4.,  8., 10.,
         7., 14.,  8., 13.,  5.,  6.,  6.,  8.,  7.,  8.,  5.,  5.,  6.,
         8.,  5.,  5.,  8.,  8.,  7.,  6., 10.,  8.,  6.,  3.,  3.,  3.,
         5.,  1.]),
 'GamGam:Data:data_B.GamGam': array([84., 71., 59., 80., 65., 64., 73., 52., 53., 54., 48., 39., 51.,
        63., 32., 45., 38., 51., 51., 45., 40., 44., 30., 28., 35., 39.,
        31., 35., 28., 29., 42., 32., 23., 23., 33., 27., 23., 24., 22.,
        24., 19., 15., 19., 18., 17., 19., 17., 16., 17., 15., 15., 17.,
        11., 14.]),
 'GamGam:Data:data_C.GamGam': array([115., 123., 114.,  76.,  93.,  97.,  70.,  89.,  92.,  88.,  78.,
         81.,  93.,  64.,  61.,  90.,  62.,  54.,  58.,  60.,  62.,  47.,
         56.,  59.,  48.,  42.,  58.,  42.,  58.,  37.,  44.,  31.,  31.,
         33.,  46.,  35.,  33.,  28.,  29.,  28.,  28.,  23.,  38.,  38.,
         29.,  28.,  26.,  25.,  27.,  25.,  18.,  15.,  14.,  29.]),
 'GamGam:Data:data_D.GamGam': array([172., 161., 173., 173., 152., 148., 138., 142., 143., 117., 139.,
        132., 111., 115., 114., 128.,  86., 103., 104.,  99.,  85.,  93.,
         83.,  96.,  94.,  79.,  77.,  85.,  66.,  63.,  85.,  67.,  55.,
         62.,  71.,  59.,  67.,  58.,  61.,  51.,  35.,  63.,  41.,  50.,
         48.,  33.,  32.,  31.,  35.,  38.,  29.,  41.,  36.,  23.])}
In [17]:
data_hist = sum(data_hists.values())
In [18]:
mc_hist = sum(mc_hists.values())

We also define luminosity as $10.6~fb^{-1}$.

In [22]:
L = 1000*10.6
In [23]:
mc_hist
Out[23]:
array([ 8.43514183e-08,  8.43507019e-08,  3.28621601e-07,  1.30637231e-07,
        1.50498417e-07,  2.62837695e-07,  3.03653069e-07,  8.33247822e-07,
        9.28387645e-07,  1.40815936e-06,  3.33098834e-06,  8.63537449e-06,
        1.85798370e-05,  4.14119658e-05,  1.01452568e-04,  2.08742609e-04,
        4.05045080e-04,  6.92393804e-04,  1.01204429e-03,  1.21243621e-03,
        1.09503283e-03,  8.18722488e-04,  4.88938638e-04,  2.66967575e-04,
        1.38687550e-04,  6.55433849e-05,  2.75652261e-05,  1.25741506e-05,
        5.46395588e-06,  2.02198647e-06,  1.01422538e-06,  7.52746034e-07,
        7.66209585e-08,  3.13568917e-07,  2.91330803e-07,  1.80359951e-07,
        5.02950570e-08,  3.72399160e-08,  2.28884911e-08,  6.51198562e-09,
        7.05868786e-08,  7.43093551e-08, -1.01235855e-08,  2.23008101e-09,
       -6.55653309e-12,  1.42037493e-11,  2.72493139e-12,  2.05545803e-08,
        0.00000000e+00,  6.51745324e-12,  4.20641300e-12,  1.01537267e-08,
        1.37806637e-08,  9.73886927e-09])
In [24]:
data_hist
Out[24]:
array([393., 379., 361., 342., 324., 326., 301., 298., 300., 272., 272.,
       259., 266., 260., 220., 274., 198., 220., 228., 215., 196., 196.,
       176., 187., 185., 170., 173., 176., 160., 142., 176., 136., 115.,
       126., 157., 129., 128., 115., 118., 111.,  87., 106., 106., 114.,
       101.,  86.,  85.,  80.,  85.,  81.,  65.,  76.,  66.,  67.])

Finally we arrive to the plot with mass of Higgs boson computed from Higgs di-photon events and also datapoints. It turns out that most of the contribution to the data statistics constitute the background, but in MC data we have only signal datasets. That is why data does not match MC properly. Since the goal of this notebook was not the re-discovery of the Higgs, but acquaintance with data, features and relevant python libraries, we are going to wrap up here and follow to the next chapter. There we are going to make use of the workflow we've learned in this chapter and set up the feature processing pipeline for the production mechanism classification problem.

In [ ]:
 
eda

Exploratory data analysis

Motivation behind EDA chapter is to have a grasp on data and how representative are some features. The major goal is to drop completely uninteresting features, because we anyways have a lot of them.

For training the classifier we will need only MC data, because only MC has labels.

In [3]:
datasets = {
    "GamGam": {
        "MC": [
            ("mc_341081.ttH125_gamgam.GamGam", {"tag": "tt"}),
            ("mc_343981.ggH125_gamgam.GamGam", {"tag": "gg"}),
            ("mc_345041.VBFH125_gamgam.GamGam", {"tag": "VBF"}),
            ("mc_345318.WpH125J_Wincl_gamgam.GamGam", {"tag": "WH"}),
            ("mc_345319.ZH125J_Zincl_gamgam.GamGam", {"tag": "ZH"})

        ]
    }
}

We begin from the full list of features present in the ATLAS OpenData root files:

Out[6]:
[b'runNumber',
 b'eventNumber',
 b'channelNumber',
 b'mcWeight',
 b'scaleFactor_PILEUP',
 b'scaleFactor_ELE',
 b'scaleFactor_MUON',
 b'scaleFactor_PHOTON',
 b'scaleFactor_TAU',
 b'scaleFactor_BTAG',
 b'scaleFactor_LepTRIGGER',
 b'scaleFactor_PhotonTRIGGER',
 b'trigE',
 b'trigM',
 b'trigP',
 b'lep_n',
 b'lep_truthMatched',
 b'lep_trigMatched',
 b'lep_pt',
 b'lep_eta',
 b'lep_phi',
 b'lep_E',
 b'lep_z0',
 b'lep_charge',
 b'lep_type',
 b'lep_isTightID',
 b'lep_ptcone30',
 b'lep_etcone20',
 b'lep_trackd0pvunbiased',
 b'lep_tracksigd0pvunbiased',
 b'met_et',
 b'met_phi',
 b'jet_n',
 b'jet_pt',
 b'jet_eta',
 b'jet_phi',
 b'jet_E',
 b'jet_jvt',
 b'jet_trueflav',
 b'jet_truthMatched',
 b'jet_MV2c10',
 b'photon_n',
 b'photon_truthMatched',
 b'photon_trigMatched',
 b'photon_pt',
 b'photon_eta',
 b'photon_phi',
 b'photon_E',
 b'photon_isTightID',
 b'photon_ptcone30',
 b'photon_etcone20',
 b'photon_convType',
 b'tau_n',
 b'tau_pt',
 b'tau_eta',
 b'tau_phi',
 b'tau_E',
 b'tau_isTightID',
 b'tau_truthMatched',
 b'tau_trigMatched',
 b'tau_nTracks',
 b'tau_BDTid',
 b'ditau_m',
 b'lep_pt_syst',
 b'met_et_syst',
 b'jet_pt_syst',
 b'photon_pt_syst',
 b'tau_pt_syst',
 b'XSection',
 b'SumWeights',
 b'largeRjet_n',
 b'largeRjet_pt',
 b'largeRjet_eta',
 b'largeRjet_phi',
 b'largeRjet_E',
 b'largeRjet_m',
 b'largeRjet_truthMatched',
 b'largeRjet_D2',
 b'largeRjet_tau32',
 b'largeRjet_pt_syst',
 b'tau_charge']

We fetch first 100000 events from each file, we also measure time we spend for fetching the batch. After the traversal finishes, we end up with a dict of 100000 events per file, named according to file tag.

In [12]:
events = events_per_file(datasets)
Process:  GamGam
Type:  MC
File:  mc_341081.ttH125_gamgam.GamGam
Total Num Events:  576491
Processing:  100000 ; Time per batch:  16.59511667722836
File:  mc_343981.ggH125_gamgam.GamGam
Total Num Events:  1054711
Processing:  100000 ; Time per batch:  4.95306570827961
File:  mc_345041.VBFH125_gamgam.GamGam
Total Num Events:  497468
Processing:  100000 ; Time per batch:  3.0408954811282456
File:  mc_345318.WpH125J_Wincl_gamgam.GamGam
Total Num Events:  113765
Processing:  100000 ; Time per batch:  1.790856369305402
File:  mc_345319.ZH125J_Zincl_gamgam.GamGam
Total Num Events:  230900
Processing:  100000 ; Time per batch:  2.887145197018981
In [13]:
events.keys()
Out[13]:
dict_keys(['tt', 'gg', 'VBF', 'WH', 'ZH'])

We begin with plotting event numbers for different kinds of particles per source. We see that:

  • that jet number can vary in a wide range: 0 - 15, while spread depends on the process.
  • photons are usually two here, rarely 3+
  • lepton number > 0 mostly occurs for tt and WH events, rarely for others
  • tau and large jet events are too rare to take them into account

Some features might have the same value for most of the events for all the origins, meaning it is useless for discrimination. An example of such a feature is jet_jvt.

/afs/cern.ch/user/a/ananiev/projects/venvs/sandbox/lib64/python3.6/site-packages/numpy/lib/histograms.py:839: RuntimeWarning: invalid value encountered in greater_equal
  keep = (tmp_a >= first_edge)
/afs/cern.ch/user/a/ananiev/projects/venvs/sandbox/lib64/python3.6/site-packages/numpy/lib/histograms.py:840: RuntimeWarning: invalid value encountered in less_equal
  keep &= (tmp_a <= last_edge)
Out[16]:

Feature jet_MV20c10 is an output of b-tagging algorithm applied to the event. As we can see from the plots, it can serve as a good signature for tt events.

To confirm that tau features are irrelevant, let's plot specificly their values per mechanism:

Out[18]:
In [48]:
def plot_met_et(ax, key, events):
    ax.hist(np.log(events[b"met_et"]), weights=events[b"mcWeight"], label=f"{key}_met_et", bins=25)
    ax.legend()
    ax.set_title(key)
per_axis(plot_met_et, events).savefig("output/met_et_numbers.png")
In [45]:
def plot_met_phi(ax, key, events):
    v, _, _ = ax.hist(events[b"met_phi"], weights=events[b"mcWeight"], label=f"{key}_met_phi", bins=20)
    ax.legend()
    ax.set_ylim([np.min(v)-50, np.max(v)+100])
    ax.set_title(key)
per_axis(plot_met_phi, events).savefig("output/met_phi_numbers.png")
In [21]:
def hist_masked(ax, data, weights=None, **kwargs):
    mask = (data != np.nan) & (data != np.infty) & (data != -np.infty)
    return ax.hist(data[mask], weights=weights[mask], **kwargs)
In [28]:
def plot_pt_values(ax, key, events):
    _, bins, _ = hist_masked(ax, np.log10(events[b"jet_pt"].max()), weights=events[b"mcWeight"], label=f"{key}_jet_pt", alpha=0.7, bins=20)
    hist_masked(ax, np.log10(events[b"lep_pt"].max()), weights=events[b"mcWeight"], label=f"{key}_lep_pt", alpha=0.7, bins=bins)
    hist_masked(ax, np.log10(events[b"photon_pt"].max()), weights=events[b"mcWeight"], label=f"{key}_photon_pt", alpha=0.7, bins=bins)
    hist_masked(ax, np.log10(events[b"tau_pt"].max()), weights=events[b"mcWeight"], label=f"{key}_tau_pt", alpha=0.7, bins=bins)
    hist_masked(ax, np.log10(events[b"largeRjet_pt"].max()), weights=events[b"mcWeight"], label=f"{key}_largeRjet_pt", alpha=0.7, bins=bins)
    ax.legend()
    ax.set_title(key)
per_axis(plot_pt_values, events).savefig("output/pt_number.png")
/afs/cern.ch/user/a/ananiev/projects/venvs/sandbox/lib64/python3.6/site-packages/ipykernel_launcher.py:2: RuntimeWarning: invalid value encountered in log10
  
/afs/cern.ch/user/a/ananiev/projects/venvs/sandbox/lib64/python3.6/site-packages/ipykernel_launcher.py:3: RuntimeWarning: invalid value encountered in log10
  This is separate from the ipykernel package so we can avoid doing imports until
/afs/cern.ch/user/a/ananiev/projects/venvs/sandbox/lib64/python3.6/site-packages/ipykernel_launcher.py:5: RuntimeWarning: invalid value encountered in log10
  """
/afs/cern.ch/user/a/ananiev/projects/venvs/sandbox/lib64/python3.6/site-packages/ipykernel_launcher.py:6: RuntimeWarning: invalid value encountered in log10
  
In [ ]:
 
feature_processing

Feature processing for production mechanism classification

Di-photon event selection based on https://arxiv.org/abs/1802.04146

We still keep working with monte carlo simulations, since only they have labels.

In [3]:
datasets = {
    "GamGam": {
        "MC": [
            ("mc_341081.ttH125_gamgam.GamGam", {"tag": "tt"}),
            ("mc_343981.ggH125_gamgam.GamGam", {"tag": "gg"}),
            ("mc_345041.VBFH125_gamgam.GamGam", {"tag": "VBF"}),
            ("mc_345318.WpH125J_Wincl_gamgam.GamGam", {"tag": "Wp"}),
            ("mc_345319.ZH125J_Zincl_gamgam.GamGam", {"tag": "Z"})

        ]
    }
}

We still follow "macro" / "micro" features classification, where "micro" features describe particles within the event (pT of 1st, 2nd, 3rd photon...). We extract features from event batch in groups per particle type. We begin from photons. Basically we repeat cuts applied in Higgs diphoton analysis, but here we use thinner region around higgs mass and also we don't care much about the rapidity cuts. Since we train on MC, we don't need to check whether events hit the detector.

In [10]:
def process_photons(events, mask):
    macro_mask = mask.copy()
    
    macro_events = {}
    micro_events = {}
    
    macro_events[b"photon_n"] = events[b"photon_n"][macro_mask]
    macro_events[b"trigP"] = events[b"trigP"][macro_mask]
    n_threshold = macro_events[b"photon_n"] >= 2
    is_diphoton = macro_events[b"trigP"]
    macro_mask[macro_mask] = n_threshold & is_diphoton
    
    micro_events[b"photon_pt"] = events[b"photon_pt"][macro_mask]
    micro_events[b"photon_eta"] = events[b"photon_eta"][macro_mask]
    micro_events[b"photon_phi"] = events[b"photon_phi"][macro_mask]
    micro_events[b"photon_E"] = events[b"photon_E"][macro_mask]
    micro_events[b"photon_isTightID"] = events[b"photon_isTightID"][macro_mask]
    micro_events[b"photon_trigMatched"] = events[b"photon_trigMatched"][macro_mask]
    micro_events[b"photon_ptcone30"] = events[b"photon_ptcone30"][macro_mask]
    micro_events[b"photon_etcone20"] = events[b"photon_etcone20"][macro_mask]
    
    
    pts = micro_events[b"photon_pt"].argsort(ascending=False)
    row_indices = np.arange(pts.shape[0])
    lead_pts = pts[:, 0]
    sublead_pts = pts[:, 1]
    
    
    macro_events[b"photon_n"] = macro_events[b"photon_n"][macro_mask]
    macro_events[b"photon_1lead_pt"] = micro_events[b"photon_pt"][row_indices, lead_pts]
    macro_events[b"photon_1lead_eta"] = micro_events[b"photon_eta"][row_indices, lead_pts]
    macro_events[b"photon_1lead_phi"] = micro_events[b"photon_phi"][row_indices, lead_pts]
    macro_events[b"photon_1lead_E"] = micro_events[b"photon_E"][row_indices, lead_pts]
    macro_events[b"photon_1lead_isTightID"] = micro_events[b"photon_isTightID"][row_indices, lead_pts]
    macro_events[b"photon_1lead_trigMatched"] = micro_events[b"photon_trigMatched"][row_indices, lead_pts]
    macro_events[b"photon_1lead_ptcone30"] = micro_events[b"photon_ptcone30"][row_indices, lead_pts]
    macro_events[b"photon_1lead_etcone20"] = micro_events[b"photon_etcone20"][row_indices, lead_pts]
    macro_events[b"photon_2lead_pt"] = micro_events[b"photon_pt"][row_indices, sublead_pts]
    macro_events[b"photon_2lead_eta"] = micro_events[b"photon_eta"][row_indices, sublead_pts]
    macro_events[b"photon_2lead_phi"] = micro_events[b"photon_phi"][row_indices, sublead_pts]
    macro_events[b"photon_2lead_E"] = micro_events[b"photon_E"][row_indices, sublead_pts]
    macro_events[b"photon_2lead_isTightID"] = micro_events[b"photon_isTightID"][row_indices, sublead_pts]
    macro_events[b"photon_2lead_trigMatched"] = micro_events[b"photon_trigMatched"][row_indices, sublead_pts]
    macro_events[b"photon_2lead_ptcone30"] = micro_events[b"photon_ptcone30"][row_indices, sublead_pts]
    macro_events[b"photon_2lead_etcone20"] = micro_events[b"photon_etcone20"][row_indices, sublead_pts]
    
    macro_filter = (  (macro_events[b"photon_1lead_pt"] > 25000)
                    &
                      (macro_events[b"photon_2lead_pt"] > 25000)
                    &
                      (macro_events[b"photon_1lead_isTightID"])
                    &
                      (macro_events[b"photon_2lead_isTightID"])
                    &
                      (macro_events[b"photon_1lead_trigMatched"])
                    &
                      (macro_events[b"photon_2lead_trigMatched"])
                    & 
                      (macro_events[b"photon_1lead_ptcone30"] < 0.065)
                    & 
                      (macro_events[b"photon_1lead_etcone20"] < 0.065)
                    & 
                      (macro_events[b"photon_2lead_ptcone30"] < 0.065)
                    & 
                      (macro_events[b"photon_2lead_etcone20"] < 0.065)
                   )
    
    dict_apply_mask(macro_events, macro_filter)
    macro_mask[macro_mask] = macro_filter

    macro_events[b"h_mass"] = np.sqrt(2.*macro_events[b"photon_1lead_E"]
                                        *macro_events[b"photon_2lead_E"]
                                        *(1. - atlas_two_cosine(macro_events, b"photon_1lead", b"photon_2lead"))
                                     )
    
    mass_cutoff =   (macro_events[b"photon_1lead_E"]/macro_events[b"h_mass"] > 0.35) \
                  & (macro_events[b"photon_2lead_E"]/macro_events[b"h_mass"] > 0.25) \
                  & (macro_events[b"h_mass"] >= 115000.) \
                  & (macro_events[b"h_mass"] <= 135000)
    
    dict_apply_mask(macro_events, mass_cutoff)
    macro_mask[macro_mask] = mass_cutoff
    
    del macro_events[b"photon_1lead_isTightID"]
    del macro_events[b"photon_2lead_isTightID"]
    del macro_events[b"photon_1lead_trigMatched"]
    del macro_events[b"photon_2lead_trigMatched"]
    del macro_events[b"photon_1lead_ptcone30"]  # both null
    del macro_events[b"photon_2lead_ptcone30"]
    del macro_events[b"trigP"]
    
    return macro_events, macro_mask
#process_photons(test_mc_events, np.ones_like(test_mc_events[b"photon_n"], dtype=np.bool))

Weights extraction completely repeats the di-photon analysis

In [11]:
def process_weights(events, mask):
    total_weights = events[b"SumWeights"][0]
    x_section = events[b"XSection"][0]
    weights = (  events[b"mcWeight"]
               * events[b'scaleFactor_PILEUP'] 
               * events[b'scaleFactor_ELE'] 
               * events[b'scaleFactor_MUON'] 
               * events[b'scaleFactor_PHOTON'] 
               * events[b'scaleFactor_TAU'] 
               * events[b'scaleFactor_BTAG'] 
               * events[b'scaleFactor_LepTRIGGER'] 
               * events[b'scaleFactor_PhotonTRIGGER']
              )[mask]/total_weights*x_section
    return {b"weight": weights}, mask
#process_weights(test_mc_events, np.ones_like(test_mc_events[b"photon_n"], dtype=np.bool))
In [12]:
def extract_descriptive(events, field):
    values = events[field]
    return {
        field + b"_min": values.min(),
        field + b"_max": values.max(),
        field + b"_mean": values.mean(),
        field + b"_sum": values.sum(),
        field + b"_std": values.std()
    }

For jets and leptons, in addition to their macro features we also extract descriptive features, like: min, max, mean, sum, std computed on microfeatures. For instance: max pt of leptons within the event. In this way we avoid the problem of variable feature number, which then will become an issue for ML algorithms. There are several ways to tackle that problem. While we are following only one of them, knowing the possibilities might be helpful when developing more advanced solutions.

Varible feature number handling options:

  • Feature aggregation (the one we follow due to its simplicity)
  • Data imputation. Assume event always has "15" leptons, fill in the gaps with mean or from the distribution if number of leptons < "15".
  • Set kernels, that allow computing "distance" between sets agnostically to the order of elements and allowing sets of different sizes to be compared. One of the famous ones — Pyramid Match, that compares overlaps between histograms built from feautre sets.
  • Neural nets that find embedding of sets of varible size in the fixed dimensional space. Example: DeepSets
In [13]:
def process_jets(events, mask):
    macro_mask = mask.copy()
    
    macro_events = {}
    micro_events = {}
    
    macro_events[b"jet_n"] = events[b"jet_n"][macro_mask]
    
    micro_events[b"jet_pt"] = events[b"jet_pt"][macro_mask]
    micro_events[b"jet_theta"] = np.arctan(eta2tg_theta(events[b"jet_eta"][macro_mask]))
    micro_events[b"jet_phi"] = events[b"jet_phi"][macro_mask]
    micro_events[b"jet_E"] = events[b"jet_E"][macro_mask]
    micro_events[b"jet_MV2c10"] = events[b"jet_MV2c10"][macro_mask]
    
    macro_events.update(extract_descriptive(micro_events, b"jet_pt"))
    macro_events.update(extract_descriptive(micro_events, b"jet_phi"))
    macro_events.update(extract_descriptive(micro_events, b"jet_E"))
    macro_events.update(extract_descriptive(micro_events, b"jet_theta"))
    macro_events.update(extract_descriptive(micro_events, b"jet_MV2c10"))
    
    return macro_events, macro_mask
#process_jets(test_mc_events, np.ones_like(test_mc_events[b"jet_n"], dtype=np.bool))
In [14]:
def process_lep(events, mask):
    macro_mask = mask.copy()
    
    macro_events = {}
    micro_events = {}
    
    macro_events[b"lep_n"] = events[b"lep_n"][macro_mask]
    
    micro_events[b"lep_pt"] = events[b"lep_pt"][macro_mask]
    micro_events[b"lep_theta"] = np.arctan(eta2tg_theta(events[b"lep_eta"][macro_mask]))
    micro_events[b"lep_phi"] = events[b"lep_phi"][macro_mask]
    micro_events[b"lep_E"] = events[b"lep_E"][macro_mask]
    micro_events[b"lep_z0"] = events[b"lep_z0"][macro_mask]
    micro_events[b"lep_charge"] = events[b"lep_charge"][macro_mask]
    micro_events[b"lep_ptcone30"] = events[b"lep_ptcone30"][macro_mask]
    micro_events[b"lep_etcone20"] = events[b"lep_etcone20"][macro_mask]
    
    macro_events.update(extract_descriptive(micro_events, b"lep_pt"))
    macro_events.update(extract_descriptive(micro_events, b"lep_phi"))
    macro_events.update(extract_descriptive(micro_events, b"lep_E"))
    macro_events.update(extract_descriptive(micro_events, b"lep_theta"))
    macro_events.update(extract_descriptive(micro_events, b"lep_charge"))
    macro_events.update(extract_descriptive(micro_events, b"lep_z0"))
    macro_events.update(extract_descriptive(micro_events, b"lep_ptcone30"))
    macro_events.update(extract_descriptive(micro_events, b"lep_etcone20"))
    
    return macro_events, macro_mask
#process_lep(test_mc_events, np.ones_like(test_mc_events[b"lep_n"], dtype=np.bool))

Finally, we apply all the feature extraction functions mentioned above to event batches. We store outputs from all the MC files into the same output file with features. Production mechanism is now stored in the separate column as a textual label for each row. We spawn a thread per MC file, since most of the time take IO operations, threads don't introduce overhead to the process.

In [20]:
events_per_file(datasets, os.path.join(output_path, "hgg_features.tsv"))
Process:  GamGam
Type:  MC
File:  mc_341081.ttH125_gamgam.GamGam
root://eosuser.cern.ch//eos/user/a/ananiev/data/GamGam/MC/mc_341081.ttH125_gamgam.GamGam.root
File:  mc_343981.ggH125_gamgam.GamGam
root://eosuser.cern.ch//eos/user/a/ananiev/data/GamGam/MC/mc_343981.ggH125_gamgam.GamGam.root
File:  mc_345041.VBFH125_gamgam.GamGam
root://eosuser.cern.ch//eos/user/a/ananiev/data/GamGam/MC/mc_345041.VBFH125_gamgam.GamGam.root
File:  mc_345318.WpH125J_Wincl_gamgam.GamGam
root://eosuser.cern.ch//eos/user/a/ananiev/data/GamGam/MC/mc_345318.WpH125J_Wincl_gamgam.GamGam.root
File:  mc_345319.ZH125J_Zincl_gamgam.GamGam
root://eosuser.cern.ch//eos/user/a/ananiev/data/GamGam/MC/mc_345319.ZH125J_Zincl_gamgam.GamGam.root
Wp Processing: 100000

Z Processing: 100000

gg Processing: 100000

tt Processing: 100000

VBF Processing: 100000

Wp Processing: 200000

Z Processing: 200000

gg Processing: 200000

Z Processing: 300000

tt Processing: 200000

VBF Processing: 200000

gg Processing: 300000

tt Processing: 300000

gg Processing: 400000

VBF Processing: 300000

tt Processing: 400000

VBF Processing: 400000

gg Processing: 500000

tt Processing: 500000

VBF Processing: 500000

gg Processing: 600000

tt Processing: 600000

gg Processing: 700000

gg Processing: 800000

gg Processing: 900000

gg Processing: 1000000

gg Processing: 1100000

Done! Time spent:  64.71421001688577
modelling

Modelling

In this chapter we are going to train few simple classifiers, discover issues specific to the dataset, present several ways to target them in future model developments.

In [4]:
RANDOM_STATE = 23

Descriptive statistics

First of all let's look at the event counts, and how the weights are distributed among these events:

  • event_num — number of events per process
  • weight_sums — sum of weights per process
  • weight_frac — fraction of weights per process normalized to the total weight of all events
  • weight_per_event — average weight of event within the process
Out[5]:
event_num weight_sums weight_frac weight_per_event
VBF 61021 5.649031e-04 0.080207 9.257520e-09
Wp 12905 1.033396e-04 0.014673 8.007720e-09
Z 26377 9.719725e-05 0.013800 3.684924e-09
gg 121078 6.277415e-03 0.891295 5.184604e-08
tt 59236 1.730808e-07 0.000025 2.921885e-12

To summarize:

  • tt events are not a lot, and they have relatively small weight
  • Wp, Z are quite few, but the weight is significant
  • VBF are quite a lot and the weight is mediocre
  • gg are the decent majority, while their weight is quite small

Also we list features we can use during training:

Out[6]:
Index(['photon_n', 'photon_1lead_pt', 'photon_1lead_eta', 'photon_1lead_phi',
       'photon_1lead_E', 'photon_1lead_etcone20', 'photon_2lead_pt',
       'photon_2lead_eta', 'photon_2lead_phi', 'photon_2lead_E',
       'photon_2lead_etcone20', 'h_mass', 'lep_n', 'lep_pt_min', 'lep_pt_max',
       'lep_pt_mean', 'lep_pt_sum', 'lep_pt_std', 'lep_phi_min', 'lep_phi_max',
       'lep_phi_mean', 'lep_phi_sum', 'lep_phi_std', 'lep_E_min', 'lep_E_max',
       'lep_E_mean', 'lep_E_sum', 'lep_E_std', 'lep_theta_min',
       'lep_theta_max', 'lep_theta_mean', 'lep_theta_sum', 'lep_theta_std',
       'lep_charge_min', 'lep_charge_max', 'lep_charge_mean', 'lep_charge_sum',
       'lep_charge_std', 'lep_z0_min', 'lep_z0_max', 'lep_z0_mean',
       'lep_z0_sum', 'lep_z0_std', 'lep_ptcone30_min', 'lep_ptcone30_max',
       'lep_ptcone30_mean', 'lep_ptcone30_sum', 'lep_ptcone30_std',
       'lep_etcone20_min', 'lep_etcone20_max', 'lep_etcone20_mean',
       'lep_etcone20_sum', 'lep_etcone20_std', 'weight', 'met_et', 'met_phi',
       'label'],
      dtype='object')

Let's train baseline model with ensemble of trees in order to identify at least somewhat relevant features to the analysis

We split dataset into training, validation and test samples. Test sample constitute 20% of the data, while validation sample is 8%. Rest 72% are training samples.

In [8]:
d_train, d_val, d_test = train_test_val_split(data, "label", 0.2, 0.1, shuffle=True, random_state=RANDOM_STATE)

Initially we choose features describing the event in general (e.g. photon number, lepton number, missing energy), then, since two photons are important features of the event we pick two the most energetic photons and include all the features describing them in detail (e.g. eta, phi, pt, E). Then we include aggregated features for leptons, like sum of phi, sum of energies, sum of theta. Finally, we include h_mass.

Sometimes there are no leptons in the event, then we impose zero values for all the features describing them.

Find the full feature list below.

In [12]:
BaselineFeaturePipeline = \
    pipeline_maker(Pipeline, steps=[
        ('keep_features', KeepFeatures(['photon_n', 'photon_1lead_pt',
                                        'photon_1lead_eta', 'photon_1lead_phi',
                                        'photon_1lead_E', 'photon_1lead_etcone20',
                                        'photon_2lead_pt', 'photon_2lead_eta',
                                        'photon_2lead_phi', 'photon_2lead_E',
                                        'photon_2lead_etcone20', 'h_mass',
                                        'met_et', 'met_phi', 'lep_pt_sum',
                                        'lep_phi_mean', 'lep_phi_std', 'lep_E_sum',
                                        'lep_theta_mean', 'lep_theta_std',
                                        'lep_charge_sum', 'lep_ptcone30_sum',
                                        'lep_etcone20_sum', 'label', 'weight'])),
        ('filna', FillNa(0.))
    ])
baseline_feature_pipeline = BaselineFeaturePipeline().fit(d_train)
baseline_features = baseline_feature_pipeline.transform(d_train)
baseline_features
Out[12]:
photon_n photon_1lead_pt photon_1lead_eta photon_1lead_phi photon_1lead_E photon_1lead_etcone20 photon_2lead_pt photon_2lead_eta photon_2lead_phi photon_2lead_E ... lep_phi_mean lep_phi_std lep_E_sum lep_theta_mean lep_theta_std lep_charge_sum lep_ptcone30_sum lep_etcone20_sum label weight
268590 2 58372.110 0.481853 1.889242 65280.734 -956.71640 48145.110 1.724146 -1.356490 139285.110 ... 0.000000 0.00000 0.00 0.000000 0.000000 0 0.0 0.00000 gg 3.958069e-08
43080 2 229317.270 0.762101 1.533991 299196.900 -1804.33690 85816.920 0.511017 0.655814 97267.940 ... 0.000000 0.00000 0.00 0.000000 0.000000 0 0.0 0.00000 tt -4.200527e-12
249823 2 76091.230 -0.271051 2.498819 78903.540 -240.93822 51134.625 -0.885657 -0.948343 72535.000 ... 0.000000 0.00000 0.00 0.000000 0.000000 0 0.0 0.00000 gg 6.397519e-08
207150 2 104560.340 -0.122680 -0.021189 105348.164 -262.15533 49661.210 0.171808 1.953999 50395.965 ... 0.000000 0.00000 0.00 0.000000 0.000000 0 0.0 0.00000 tt 6.542525e-12
208366 2 60886.160 -0.230575 1.790319 62511.840 -417.62520 44209.543 -2.280501 1.775367 218479.450 ... 0.000000 0.00000 0.00 0.000000 0.000000 0 0.0 0.00000 tt 7.398494e-12
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
54183 2 79976.550 -0.453348 0.882136 88336.860 -520.47560 56892.867 -0.492723 -3.073204 63939.830 ... 0.000000 0.00000 0.00 0.000000 0.000000 0 0.0 0.00000 VBF 1.164016e-08
11277 2 66324.640 1.054671 -2.999878 106760.660 -914.89844 51873.164 0.489214 -0.062816 58205.375 ... 0.000000 0.00000 0.00 0.000000 0.000000 0 0.0 0.00000 Wp 5.586333e-09
266476 2 98814.510 -0.318616 0.287074 103872.700 -550.41810 58320.203 -0.143005 -1.677720 58917.555 ... 0.000000 0.00000 0.00 0.000000 0.000000 0 0.0 0.00000 gg 6.580980e-08
124031 2 114924.820 -1.277359 -1.705627 222145.450 -465.48880 32608.330 -2.213135 0.601934 150873.860 ... -0.540576 1.94015 254815.25 -0.985533 0.155805 0 0.0 -651.01685 tt 6.348159e-12
185056 2 55735.242 1.552978 -1.803264 137586.330 -224.20319 53927.145 0.644007 1.511915 65502.027 ... 0.000000 0.00000 0.00 0.000000 0.000000 0 0.0 0.00000 gg 6.623756e-08

202043 rows × 25 columns

As a baseline model we pick RandomForestClassifier. RandomForest is useful to pick relevant features when the dataset possibly includes correlated features. Correlations are weakend because RandomForest picks a subset of features to train each tree.

In [14]:
baseline_model = RandomForestClassifier(random_state=RANDOM_STATE).fit(*split_targets(baseline_features, "label", drop=["weight"]), sample_weight=baseline_features["weight"])
In [21]:
def save_figures(name, figures):
    for label, fig in figures.items():
        fig.savefig(os.path.join("output/", f"{name}_{label}.png"))

After the model has been trained, let's see the report describing the baseline.

We have 5 classes and a confusion matrix for them.

We also provide confusion matrix for weights in somewhat sophisticated format. Due to weights can differ by orders of magnitude, it is easier to compare logarithms. Since all weights are < 1, it is convenient to remove minus sign close to numbers in the matrix. As a result, the smaller number on the diagonal of the matrix the better.

Finally we show feature importances coming from decision tree classifier.

In [22]:
save_figures(
    "random-forest",
    report(baseline_model, baseline_feature_pipeline, baseline_feature_pipeline.transform(d_test), drop=["weight"])
)

Apparently gg events are so many, that other events become classified as gg just statistically. This means that the dataset is imbalanced, and we should find some approach to target this issue.

We also choose subset of features to work with. We pick only significantly relevant features. See the list of them below.

Out[25]:
['photon_1lead_pt',
 'met_phi',
 'photon_2lead_pt',
 'met_et',
 'h_mass',
 'photon_1lead_E',
 'photon_2lead_phi',
 'photon_1lead_phi',
 'photon_2lead_eta',
 'photon_1lead_eta',
 'photon_2lead_etcone20',
 'photon_2lead_E',
 'photon_1lead_etcone20']

Feature processing for next to baseline model, we name it significant imbalanced model is a bit more complex. As before, we fill empty values for chosen features with zeros.

We also drop tt, Wp and Z events for this step. We try to focus on events that have similar weight per event. Wp and Z are few but they are very heavy, while tt are quite a lot but weight is tiny. It is quite hard to handle such a mixture together.

Now we end up with gg and VBF events. The dataset is still imbalanced from the number of events perspective, but at the same time we can bother less about their weights.

As a part of feature processing, we also drop negative weights. That's just a technical issue, because the algorithm we are going to apply can't handle negative weights.

In [30]:
SignificantFeaturePipeline = \
    pipeline_maker(Pipeline, steps=[
        ('keep_features', KeepFeatures(subcolumns + ["label", "weight"])),
        ('filna', FillNa(0.)),
        ('drop_tt', DropByValues("label", ["tt", "Wp", "Z"])),
        ('drop_neg_weight', DropByLevel(["weight"], threshold=0)),
        ('labels', WithColumns(LabelEncoder(), columns="label"))
    ])
significant_feature_pipeline = SignificantFeaturePipeline().fit(d_train)
significant_features = significant_feature_pipeline.transform(d_train)
significant_features
Out[30]:
photon_1lead_pt met_phi photon_2lead_pt met_et h_mass photon_1lead_E photon_2lead_phi photon_1lead_phi photon_2lead_eta photon_1lead_eta photon_2lead_etcone20 photon_2lead_E photon_1lead_etcone20 label weight
268590 58372.110 0.600789 48145.110 30176.668 127024.91 65280.734 -1.356490 1.889242 1.724146 0.481853 -467.942500 139285.110 -956.716400 1 3.958069e-08
249823 76091.230 -0.327938 51134.625 28954.781 129304.70 78903.540 -0.948343 2.498819 -0.885657 -0.271051 -916.186650 72535.000 -240.938220 1 6.397519e-08
117105 98038.700 1.261602 41276.086 20532.875 130812.99 270122.900 -3.044603 -3.093076 0.129961 -1.671973 -325.605040 41625.150 -61.689865 1 1.902657e-08
260493 36145.746 1.109834 31157.201 20445.281 125915.72 67071.990 -2.424861 0.340449 -1.266889 1.229256 -433.397430 59689.400 -46.066826 1 6.466785e-08
100602 69329.410 -0.884942 66717.430 10639.224 122480.49 171774.700 1.492158 -2.875040 -2.292921 -1.556997 -31.067635 333747.000 -381.454530 0 6.241101e-09
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
245030 114924.770 1.792097 38737.754 131497.720 126005.20 115701.280 1.301682 -0.950417 -0.663213 -0.116181 -1386.085100 47574.062 -273.951720 1 6.292844e-08
58024 142275.270 2.380657 82808.560 38783.910 125895.39 152268.330 -1.474372 -0.448284 -0.980974 -0.372640 -1615.108500 125951.720 -200.366040 0 1.106634e-08
54183 79976.550 2.483016 56892.867 8442.810 123923.63 88336.860 -3.073204 0.882136 -0.492723 -0.453348 -91.334460 63939.830 -520.475600 0 1.164016e-08
266476 98814.510 2.241106 58320.203 70002.210 126997.98 103872.700 -1.677720 0.287074 -0.143005 -0.318616 -1023.170600 58917.555 -550.418100 1 6.580980e-08
185056 55735.242 0.696735 53927.145 24582.861 120794.63 137586.330 1.511915 -1.803264 0.644007 1.552978 -623.060300 65502.027 -224.203190 1 6.623756e-08

128378 rows × 15 columns

Following the reference about usage of bagging for imbalanced datasets we focus on the model of Easy Ensemble. It is an ensemble model, where AdaBoost (a variant of BDT) is used as a base estimator. The goal is to downsample major class in order to balance the trainig sample.

To avoid throwing out the data we train 6 different AdaBoosts on different samples, where minority class is always entirely included, while events from majority class are downsampled to fit number of events in the minority class.

After 6 different base estimators have been trained, the voting procedure decides on the prediction for the specific test event.

* Note: we had to monkey-patch the AdaBoostClassifier to allow EasyEnsemble implementation to use sample weights. See details in the full version of the notebook in the github repo.

In [33]:
sign_imbal_model = EasyEnsembleClassifier( \
                                          n_estimators=6,
                                          base_estimator=WeightedAdaBoost(weights_column=list(split_targets(significant_features, "label")[0].columns).index("weight")),
                                          random_state=RANDOM_STATE,
                                          sampling_strategy="not minority",
                                          replacement=False
                                         ) \
                   .fit(*split_targets(significant_features, "label"))

We build a report following the same structure as we had for the baseline model.

Here we have 2 classes with a simple confusion matrix. It shows that we still didn't manage to fight the imbalance.

Since here we have ensemble of BDTs, as feature importance we show median of scores per feature.

In [34]:
save_figures(
    "easy-ensemble",
    report(sign_imbal_model, significant_feature_pipeline, significant_feature_pipeline.transform(d_test))
)
Sum of weights
weight
label
0 0.000113
1 0.001258

AdaBoost can be treated as a specific representative of the class of the additive models. XGBoost is more generic and allows optimizing any model from the class. XGBoost uses gradient descent approach to optimize parameters of the model, while AdaBoost applies reweighting to the datasample at each step, in order to train on the most outlying data points. The latter effectively leads to exponential loss function.

More on AdaBoost vs XGBoost comparison:

In [35]:
xgboost_cv = GridSearchCV(XGBClassifier(random_state=RANDOM_STATE,
                                        objective="multi:softmax",
                                        num_class=2),
                          param_grid={
                            "learning_rate": [1E-10, 1E-6],
                            "max_depth": [10, 20] ,
                            "gamma": [0.1, 0.5, 1.2],
                            "reg_alpha": [1E-3, 1E-1],
                            "reg_lambda": [1E-3, 1E-1]
                          }, cv=3)
In [36]:
xgboost_cv.fit(*split_targets(significant_features, "label", drop=["weight"]),
               sample_weight=significant_features["weight"])
Out[36]:
GridSearchCV(cv=3, error_score=nan,
             estimator=XGBClassifier(base_score=None, booster=None,
                                     colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None, gamma=None,
                                     gpu_id=None, importance_type='gain',
                                     interaction_constraints=None,
                                     learning_rate=None, max_delta_step=None,
                                     max_depth=None, min_child_weight=None,
                                     missing=nan, monotone_constraints=None,
                                     n_estim...
                                     reg_alpha=None, reg_lambda=None,
                                     scale_pos_weight=None, subsample=None,
                                     tree_method=None, validate_parameters=None,
                                     verbosity=None),
             iid='deprecated', n_jobs=None,
             param_grid={'gamma': [0.1, 0.5, 1.2],
                         'learning_rate': [1e-10, 1e-06], 'max_depth': [10, 20],
                         'reg_alpha': [0.001, 0.1],
                         'reg_lambda': [0.001, 0.1]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)
In [37]:
xgboost_cv.best_estimator_
Out[37]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0.1, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=1e-06, max_delta_step=0, max_depth=10,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_class=2, num_parallel_tree=1,
              objective='multi:softmax', random_state=23, reg_alpha=0.001,
              reg_lambda=0.001, scale_pos_weight=None, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)
In [38]:
save_figures(
    "xgboost",
    report(xgboost_cv.best_estimator_, significant_feature_pipeline, significant_feature_pipeline.transform(d_test), drop=["weight"])
)
/afs/cern.ch/user/a/ananiev/projects/venvs/sandbox/lib64/python3.6/site-packages/ipykernel_launcher.py:21: RuntimeWarning: divide by zero encountered in log10
/afs/cern.ch/user/a/ananiev/projects/venvs/sandbox/lib64/python3.6/site-packages/xgboost/sklearn.py:691: RuntimeWarning: invalid value encountered in true_divide
  return all_features / all_features.sum()

Conclusion

Within this toy project we managed to reproduce Higgs mass peak, while discovering that ATLAS Open Data is not enough to reproduce backgrounds taking part in the process. Moreover, we managed to set up the end-to-end pipeline for feature processing and learning. It is implemented in Python but, thanks to NumPy and "family", still not less efficient than C++ codes.

Feature processing pipeline, now provides basic set of simple features and can easily be extended to producing more sophisticated collections without loss in speed. We learned that jagged data is the essential feature of in HEP analyses, and future attempts to train on particle data should account on this. Having this in mind, we, however, decided to follow the simplest approach for the sake of building a wider picture of the data analysis pipeline in ATLAS.

Finally, several models have been tested in order to classify Higgs boson production mechanisms. All of them show results highly biased towards gluon-gluon fusion. This most probably caused by the fact that the data is imbalanced by both parameters: number of events and event weights. Sometimes minority by number of events can consist majority in weights.

In [ ]: