Strings in pandas dataframe from uproot

Question

Working with simulations in Geant4 that outputs .root files, I was happy to discover the uproot package.

Believing that dataframes are the best joice for my specific ananylsis task, I'm using uproot.pandas.df() to read contents from a TTree into such a dataframe.

Unfortunately, this turned out to present a bottleneck. While the code deals well with all numeric input, handling strings seems to pose a serious problem. The file is quite big, with the resulting frame having 2406703 rows.

While this code (Egamma and z_eu both numeric):

df = uproot.open('rootFile.root')['seco_tuple;1].pandas.df( ['Egamma','z_eu'])

needs on average 430 ms, including already one column with strings:

df = uproot.open('rootFile.root')['seco_tuple;1].pandas.df( ['Name','Egamma','z_eu'])

increases the time to almost 3.5 s. Having a second column with strings doubles the time. What I also tried was reading the data in a dictionary and then passing it to a dataframe. Reading the data was fairly quick, but passing it into a dataframe then again quite slow.

Since the strings obviously cause the code to take much more ressources, I was wondering if strings in general are a problem for dataframes or if the specific type of string here might make a difference?

I hope to get some further insight here and can try to provide the .root file as well as a MWE, in case this is required.

Thanks in advance!

score 0 · Answer 1 · answered Nov 19 '19 at 16:01

Part of the problem is that uproot, written in pure Python, is sharply divided in performance between numeric work, which NumPy handles very quickly, and anything involving Python objects, such as strings. In some cases, like this one, we have the option of treating strings as a numeric array, and that can help performance a lot.

The other part is Pandas itself. NumPy's strings are inefficient in the sense that they have to be padded to a common length—the length of the longest string in the dataset—because NumPy can only deal with rectilinear arrays. So Pandas opts for a different inefficiency: it takes strings as Python objects, via a NumPy array whose dtype is object (i.e. pointers to Python objects, not raw numerical data).

>>> import pandas, numpy
>>> df = pandas.DataFrame({"column": ["one", "two", "three", "four", "five"]})
>>> df
  column
0    one
1    two
2  three
3   four
4   five
>>> df.values
array([['one'],
       ['two'],
       ['three'],
       ['four'],
       ['five']], dtype=object)

When strings are just labels, Pandas has a "categorical" dtype (not part of NumPy!) that replaces each distinct string with an integer so that it's really an integer array with metadata.

>>> df["column"].astype("category")
0      one
1      two
2    three
3     four
4     five
Name: column, dtype: category
Categories (5, object): [five, four, one, three, two]

If you have a small number of distinct strings compared to the total number of strings (not the case above), then this is faster and uses less memory. If every string is unique, this only bloats the data.

Perhaps uproot's DataFrame conversion should read some string-valued branches into a "categorical" dtype. That would need to be explicitly requested by the user as an argument, since it's not always helpful. Such a thing would go in the uproot._connect._pandas.futures2df function—I'd accept a PR for this if anybody is willing to contribute one.

Hi thanks a lot for the input. It seems that handling this data set with `DataFrame` doesn't provide a efficient solution. Without changing the way in which `uproot` reads the data to a `DataFrame`, there seems to be no gain in speed or memory usage, at least to my (admittedly limited) knowledge ... — Number42, Nov 20 '19 at 17:06

Strings in pandas dataframe from uproot

1 Answers1