uproot: processing a TH2D using the uproot method .pandas()

Question

I am very new to uproot and Python, but hopefully catching up quickly. I am wondering why the method .pandas() is creating such a weird table from a TH2D histogram:

myhisto = file["angular_distr_el/ID3_mol_e0_valid/EN_gate/check_cthetaEE_x"]
type(myhisto)

outputs:

uproot.rootio.TH2D

Finally, myhisto.pandas() returns:

        count   variance
cos(theta)  electron energy [eV]        
[-inf, -1.0)    [-inf, 10.0)    0.0 0.0
[10.0, 10.15)   0.0 0.0
[10.15, 10.3)   0.0 0.0
[10.3, 10.45)   0.0 0.0
[10.45, 10.6)   0.0 0.0
... ... ... ...
[1.0, inf)  [24.4, 24.549999999999997)  0.0 0.0
[24.549999999999997, 24.7)  0.0 0.0
[24.7, 24.85)   0.0 0.0
[24.85, 25.0)   0.0 0.0
[25.0, inf) 0.0 0.0
2244 rows × 2 columns

and myhisto.columns returns:

Index(['count', 'variance'], dtype='object')

Where can I find the documentation of the method .pandas() to understand what it is doing? Is there a way to reorganise myhisto in a DataFrame with the right columns?

question related to a previous issue partially solved by @JimPivarski https://stackoverflow.com/questions/63738534/uproot-best-way-to-load-and-replot-a-th2-histogram-from-a-root-file-on-a-jupyt/63740917?noredirect=1#comment112739286_63740917 — giammi56, Sep 08 '20 at 09:11

giammi56 · Accepted Answer · 2020-09-08T12:48:38.683

After some fun but desperate browsing, I understand which type of object it is. It is a very clever way of creating sorted MultiIndex DataFrames. Just typing myhisto.index is possible to see it directly:

MultiIndex([([-inf, -1.0),                [-inf, 10.0)),
            ([-inf, -1.0),               [10.0, 10.15)),
            ([-inf, -1.0),               [10.15, 10.3)),
            ([-inf, -1.0),               [10.3, 10.45)),
            ([-inf, -1.0),               [10.45, 10.6)),
            ([-inf, -1.0),               [10.6, 10.75)),
            ([-inf, -1.0),               [10.75, 10.9)),
            ([-inf, -1.0),               [10.9, 11.05)),
            ([-inf, -1.0),               [11.05, 11.2)),
            ([-inf, -1.0),               [11.2, 11.35)),
            ...
            (  [1.0, inf), [23.65, 23.799999999999997)),
            (  [1.0, inf), [23.799999999999997, 23.95)),
            (  [1.0, inf),               [23.95, 24.1)),
            (  [1.0, inf),               [24.1, 24.25)),
            (  [1.0, inf),               [24.25, 24.4)),
            (  [1.0, inf),  [24.4, 24.549999999999997)),
            (  [1.0, inf),  [24.549999999999997, 24.7)),
            (  [1.0, inf),               [24.7, 24.85)),
            (  [1.0, inf),               [24.85, 25.0)),
            (  [1.0, inf),                 [25.0, inf))],
           names=['cos(theta)', 'electron energy [eV]'], length=2244)

The solution is to unstack or create a pivot table of the DataFrame. For this specific object, a pivot table is better, because of the presence of counts AND variance as columns in the original DataFrame. As an example:

myhisto.unstack()

count   ... variance
electron energy [eV]    [-inf, 10.0)    [10.0, 10.15)   [10.15, 10.3)   [10.3, 10.45)   [10.45, 10.6)   [10.6, 10.75)   [10.75, 10.9)   [10.9, 11.05)   [11.05, 11.2)   [11.2, 11.35)   ... [23.65, 23.799999999999997) [23.799999999999997, 23.95) [23.95, 24.1)   [24.1, 24.25)   [24.25, 24.4)   [24.4, 24.549999999999997)  [24.549999999999997, 24.7)  [24.7, 24.85)   [24.85, 25.0)   [25.0, inf)
cos(theta)                                                                                  
[-inf, -1.0)    0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
[-1.0, -0.9)    0.0 1.0 1.0 0.0 0.0 2.0 0.0 2.0 0.0 1.0 ... 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
[-0.9, -0.8)    0.0 0.0 3.0 3.0 0.0 0.0 0.0 0.0 1.0 1.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
[-0.8, -0.7)    0.0 0.0 1.0 2.0 0.0 1.0 1.0 2.0 1.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
[-0.7, -0.6)    0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0
[-0.6, -0.5)    0.0 1.0 1.0 1.0 0.0 0.0 2.0 1.0 0.0 3.0 ... 0.0 1.0 0.0 1.0 1.0 

**22 rows × 204 columns**

vs.

pivot_pipanda = pipanda.pivot_table(values="count", index="cos(theta)", columns="electron energy [eV]")

electron energy [eV]    [-inf, 10.0)    [10.0, 10.15)   [10.15, 10.3)   [10.3, 10.45)   [10.45, 10.6)   [10.6, 10.75)   [10.75, 10.9)   [10.9, 11.05)   [11.05, 11.2)   [11.2, 11.35)   ... [23.65, 23.799999999999997) [23.799999999999997, 23.95) [23.95, 24.1)   [24.1, 24.25)   [24.25, 24.4)   [24.4, 24.549999999999997)  [24.549999999999997, 24.7)  [24.7, 24.85)   [24.85, 25.0)   [25.0, inf)
cos(theta)                                                                                  
[-inf, -1.0)    0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
[-1.0, -0.9)    0.0 1.0 1.0 0.0 0.0 2.0 0.0 2.0 0.0 1.0 ... 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
[-0.9, -0.8)    0.0 0.0 3.0 3.0 0.0 0.0 0.0 0.0 1.0 1.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
[-0.8, -0.7)    0.0 0.0 1.0 2.0 0.0 1.0 1.0 2.0 1.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
[-0.7, -0.6)    0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0
[-0.6, -0.5)    0.0 1.0 1.0 1.0 0.0 0.0 2.0 1.0 0.0 3.0 ... 0.0 1.0 0.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0
[-0.5, -0.3999999999999999) 0.0 0.0 2.0 0.0 1.0 1.0 3.0 2.0 3.0 1.0 ... 3.0 0.0 0.0 0.0 0.0 2.0 0.0 1.0 1.0 0.0

and from here the standard methods of pandas are available!

(To play with slicing techniques like loc[] and iloc[]: https://www.youtube.com/watch?v=tcRGa2soc-c)

You figured this out before I got a chance to explain it. The `TH*.pandas()` function is underdocumented, but once it's in Pandas form, there's a lot of information online about working with Pandas DataFrames. The thinking is that DataFrames don't fit particle physics event data well because a variable number of particles isn't rectilinear like a table. However, aggregated data (histograms) _do_ fit Pandas's data model well. Since Pandas has an Interval Index type, that's a natural way to represent a histogram. — Jim Pivarski, Sep 08 '20 at 13:20
It even merges histograms with different binnings well, though it assumes that missing bins should be filled with NaN, rather than 0 (which you can fill with `fillna`). You can rebin with Pandas's resampling, though you have to explicitly tell it that you want the aggregation function to be `sum`. The problem is that histograms have a natural aggregation function, sum (whose identity is 0), but Pandas doesn't know to default to this. (Understandably: it's more general than that.) Note that the error column is "variance" so that they add like the "counts". — Jim Pivarski, Sep 08 '20 at 13:26
Thank you for your precious suggestions! There is still a problem: now my index (as well as coloum) is of the type IntervalIndex (series of tuples?). This is not a supported format for plt.plot ! How would you proceed to plot the TH2D? — giammi56, Sep 08 '20 at 16:27
@JimPivarski I tried the approach suggested here ( https://github.com/pandas-dev/pandas/issues/33560 ) using plt.pcloromesh(), but both the first and the last values of the IntervallIndex index and coloumn are +-inf. How can I deal with this? Clearly, reducing the IntervallIndex by slicing [1:] and [:-1] results in a mismatch with the dimensions of the C matrix (i.e. the counts). A strategy could be to set the +inf values to +the distance between the .right values of the same index. I will try to implement this, but could you suggest me you approach? Thank you! — giammi56, Sep 08 '20 at 21:43
You could slice the edges by `edges[1:-1]` and the 2d bins by `bins[1:-1, 1:-1]`. — Jim Pivarski, Sep 09 '20 at 14:34
@JimPivarski doing so the size of the index and colum doesn't match anymore the C matrix. I need to repalce the bin. Any idea? — giammi56, Sep 09 '20 at 14:48
the question is posed in a better fashion here https://stackoverflow.com/questions/63814507/how-to-modify-inf-into-an-index — giammi56, Sep 09 '20 at 15:09
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/221221/discussion-between-giammi-and-jim-pivarski). — giammi56, Sep 09 '20 at 15:10

uproot: processing a TH2D using the uproot method .pandas()

1 Answers1

Linked