Reading ROOT files to dataframes: uproot slower than root_numpy's tree2array

Question

I am trying to convert my flat ROOT ntuples into dataframes (via arrays). Currently, I am using root_numpy and would like to use uproot instead to avoid any ROOT dependencies.

I have a list of string names of the 20 variables to save (from a total of 104 variables in the tree) named vars_to_save.

In root_numpy I use:

rootFile = ROOT.TFile(f)
intree = rootFile.Get(tree_name)
arr = tree2array(intree,branches=vars_to_save) 
df_root_numpy = pd.DataFrame(arr)

with this taking ~ 2 seconds ( for 491176 events and the 20 variables in the list of strings vars_to_save)

In uproot I have tried:

#attempt 1
tree = uproot.open(f)[tree_name]
df_uproot = tree.pandas.df(vars_to_save) 

#attempt 2 
tree = uproot.open(f)[tree_name]
df_uproot = tree.arrays(vars_to_save, outputtype=pd.DataFrame)

#attempt 3
tree = uproot.open(f)[tree_name]
arr = tree.arrays(vars_to_save)
df_uproot = pd.DataFrame(arr)[vars_to_save]

with each taking ~ 45 seconds ( around 20 times slower). In attempt 3 I notice that the tree.arrays() step is the slowest step taking around 40 seconds.

Is there a way in uproot to speed up this operation?

Attempt 3 is good because there are subtleties in constructing DataFrames, but you've tried that. If these are really pure numerical TBranches, then it's contrary to our tests, which found Uproot to exceed root_numpy by factors of several. One guess as to what might be going on is that these are small TBaskets (< 10 kB), where Uproot would have Python overhead in navigating from TBasket to TBasket, whereas root_numpy (actually, ROOT) navigates them in C++. Another guess is that the file is remote and Uproot 3 reads too much from remote files (being fixed in Uproot 4). — Jim Pivarski, Jun 14 '20 at 15:45
Short story: you're not doing anything wrong, the difference you see is dramatic, but it can't be diagnosed without more information. — Jim Pivarski, Jun 14 '20 at 15:45
Hi Jim, thanks for such a quick reply! I have had a look at the datatypes of these variables I’m saving and see: '>f8', '>f4', '>f4', '>f4', '>f4', '>f4', '>f4', '>f4', '>i4', '>i4', '>f4', '>f4', '>f4', '>f4', '>f4', '>f4', 'bool', '>f4', '>f4', '>f8' So I tried saving all the variables but not the boolean, however, it still takes a similar speed (~ 40seconds). I have another ntuple with only int / float branches for which the uproot methods above are faster than root_numpy as expected – so I am inclined to think it is something to do with these boolean branches. — Dnoel, Jun 15 '20 at 17:57
Would having booleans in the tree which we don’t save affect the performance? If not I will investigate the Tbasket sizes. Cheers — Dnoel, Jun 15 '20 at 17:57
ROOT's algorithm for determining TBasket sizes is complex; the addition of a TBranch can cause it to break the data into smaller TBaskets to get them to line up on entry boundaries. Instead of trying different configurations looking only at the reading speed, look at the TBasket sizes: `[branch.numbaskets for branch in tree]` (if `numbaskets` is large, then they must all be small; alternatively, look at `branch.basket_compressedbytes(i)` for each basket `i`). — Jim Pivarski, Jun 16 '20 at 16:14
I checked the numbaskets and yes the slow file in uproot (nevents = 490 000) has 13 700 baskets, whereas the faster file in uproot (nevents = 480 000) has 64 baskets! As expected each of the 13 700 baskets are very small too. Is there anything on the uproot side that can be done or is this a problem with the ROOT file itself? — Dnoel, Jun 18 '20 at 11:56
It's a problem with the ROOT file. The Achilles' heel of Uproot is that file navigation is done in Python and Python is slow. "Good" files with large batches of columnar data are fine because a whole batch can be read with a single NumPy call and NumPy is fast. We've thought about writing a C extension to do file navigation, but that would undermine Uproot's portability, and after all, C++ ROOT already covers the case of "compiled/fast but a little less portable." — Jim Pivarski, Jun 19 '20 at 12:28
Incidentally, a file with a large number of small TBaskets is going to be a performance problem for any system—even compiled code will be slowed down by unvectorizable loops, CPU cache misses, and indirection for small, randomly distributed batches of data, compared with large, contiguous batches. Your "slow" file would be somewhat slower in ROOT, too, but it's extra-slow in Uproot. If you have control over the file-writing process, try to reconfigure ROOT's "auto flush" to flush (write a TBasket) less frequently. Otherwise, you might want to consider re-basketing the files (in ROOT). — Jim Pivarski, Jun 19 '20 at 12:32
Oh, I forgot that you're comparing to root_numpy, but that's equivalent to ROOT in this context because root_numpy compiles against a version of ROOT. You could do the re-basketizing in root_numpy (i.e. read the many-small-baskets file in root_numpy as arrays, then write the whole arrays to a new file: the size of the output baskets would be determined by root_numpy). Your analysis on the re-basketized files would be faster whether you use root_numpy or Uproot, but it would be faster-er in Uproot. — Jim Pivarski, Jun 19 '20 at 12:36
Thanks for the comprehensive answers, I will look into re-basketizing the data in ROOT / root_numpy. Cheers! — Dnoel, Jun 22 '20 at 11:18

Reading ROOT files to dataframes: uproot slower than root_numpy's tree2array

0 Answers0