Working with simulations in Geant4 that outputs .root
files, I was happy to discover the uproot
package.
Believing that dataframes are the best joice for my specific ananylsis task, I'm using uproot.pandas.df()
to read contents from a TTree into such a dataframe.
Unfortunately, this turned out to present a bottleneck. While the code deals well with all numeric input, handling strings seems to pose a serious problem. The file is quite big, with the resulting frame having 2406703 rows.
While this code (Egamma
and z_eu
both numeric):
df = uproot.open('rootFile.root')['seco_tuple;1].pandas.df( ['Egamma','z_eu'])
needs on average 430 ms, including already one column with strings:
df = uproot.open('rootFile.root')['seco_tuple;1].pandas.df( ['Name','Egamma','z_eu'])
increases the time to almost 3.5 s. Having a second column with strings doubles the time. What I also tried was reading the data in a dictionary and then passing it to a dataframe. Reading the data was fairly quick, but passing it into a dataframe then again quite slow.
Since the strings obviously cause the code to take much more ressources, I was wondering if strings in general are a problem for dataframes or if the specific type of string here might make a difference?
I hope to get some further insight here and can try to provide the .root
file as well as a MWE, in case this is required.
Thanks in advance!