May I see a short example of cutting on data to prepare it for histogramming in uproot?

Question

I have am using Python 3.6.5 (or higher) and I have successfully installed 'numpy', 'uproot', and 'awkward'. I have a previously made *.root file with a jagged NTuple which contains quite a large number of branches. This is particle physics data and so one can think of the "rows" as individual collisions or "events" while the columns have the data structures. (And some of the columns might have a third dimension or more...I'll explain in a bit.)

In this case I have events where there are many "Jets" in the events and each "jet" has a lot of information about it.

jet_E, jet_pT, jet_eta, jet_phi, Numb (number of b tags), NLayer0 etc.

Each "event" could have any number of jets, but it is impossible for it to have zero jets in this case. Each of these jets will have this information stored, but all information from one "event" must be kept uncorrelated with any other "event". (If you already know Particle physics this part is probably already understood.)

I've been reading the uproot documentation and examples and I cannot see easily how one would, using only pythonic code like this, histogram the jet_pT but only for jets within the events where some OTHER jet variable is being cut upon. eta, for example.

How do I extract only the information from the *.root file about all the jet_pT for those jets with jet_eta>-1.0 and jet_eta<1.0 ? And suppose I only wanted to look at the first 3 jets in any event and ignore the rest, how would I put in place the cut described and only histogram the first 3 jets in any event that passed this cut?

The uproot documentation doesn't really make this very clear. Thanks!

score 0 · Answer 1 · answered Oct 22 '19 at 12:15

This is actually an awkward-array question, so I have tagged it as such. That also makes it easier to write an answer, because I can create simple artificial cases without having to put them in a ROOT file and read them back with uproot.

Here are some artificial jetpt and jeteta jagged arrays:

>>> import awkward
>>> jetpt = awkward.fromiter([[0.0, 1.1, 2.2], [3.3], [4.4, 5.5], [6.6, 7.7, 8.8, 9.9]])
>>> jeteta = awkward.fromiter([[0.1, -1.2, 0.8], [1.2], [-0.8, 0.8], [0.2, -0.3, 0.9, 0.0]])
>>> jetpt, jeteta

(<JaggedArray [[0.0 1.1 2.2] [3.3] [4.4 5.5] [6.6 7.7 8.8 9.9]] at 0x7c2c9e9950b8>,
 <JaggedArray [[0.1 -1.2 0.8] [1.2] [-0.8 0.8] [0.2 -0.3 0.9 0.0]] at 0x7c2c9dfb9320>)

The key thing about these two jagged arrays is that they have the same number of inner array elements per inner array:

>>> jetpt.counts, jeteta.counts

(array([3, 1, 2, 4]),
 array([3, 1, 2, 4]))

This is still true when we perform mathematical operations on the arrays, such as inequality comparisons. (Note that we have to use bitwise operators for and/or/not because these are the only ones Numpy can overload. You also need parentheses because of the order of operations with bitwise operations.)

>>> (-1.0 < jeteta) & (jeteta < 1.0)

<JaggedArray [[True False True] [False] [True True] [True True True True]] at 0x7c2c9dfb9b38>

>>> ((-1.0 < jeteta) & (jeteta < 1.0)).counts

array([3, 1, 2, 4])

Since these boolean arrays have the same number of counts as jetpt, you can use them as an index on jetpt or any other jet variable. You can't use them as an index on muonpt, etc., because in general, there's a different number of muons per event than jets per event. (There's a lot more about this in the "03-columnar-data-analysis" notebook in this tutorial.)

>>> jetpt[(-1.0 < jeteta) & (jeteta < 1.0)]

<JaggedArray [[0.0 2.2] [] [4.4 5.5] [6.6 7.7 8.8 9.9]] at 0x7c2ca88f8550>

Your second question was about truncating to at most three jets. This is a simple slice, but one that applies to the second dimension. I'll apply it to the example above—it needs to be a separate square bracket, and often you'd assign the above to a variable if you think that makes it easier to read.

>>> jetpt[(-1.0 < jeteta) & (jeteta < 1.0)][:, :3]

<JaggedArray [[0.0 2.2] [] [4.4 5.5] [6.6 7.7 8.8]] at 0x7c2c9dfdb4a8>

See the awkward README for a lot more!

Wow....this is more complicated than I expected and I'm not entirely sure it scales to a *.root file with 8000 events (so that would be JaggedArrays with 8k first-indicies) and then also to quantities that add a third dimension to each jet. (there are 4 "layers" of a detector for each jet and each of these layers has some number of "hits", each with positional coordinates and a weight that are attached to the layer, and then the jet) Thanks for the resources. I definately needed an explanation like this. — Todd Huffman, Oct 22 '19 at 14:16
Performance-wise, it scales because all of the loops are executed in compiled Numpy, not interpreted Python, and the above works for jagged arrays of arbitrary depth (e.g. events containing jets containing tracks containing hits). Complexity/ease-of-use is in the eye of the beholder, though. This is how a typical Numpy/MATLAB analysis works—the only new piece is jaggedness. — Jim Pivarski, Oct 22 '19 at 18:44

May I see a short example of cutting on data to prepare it for histogramming in uproot?

1 Answers1