I wanted to try using uproot to read a number of root files with flat ROOT NTupels into a desk frame. 214 files, 500kb each, about 8000 rows and 16 columns/variables in each. They easily fit in a pandas data frame in memory, but I am trying to learn dask (and uproot, only worked with root_pandas previously) since I expect larger datasets in the future.
So I thought that uproot.daskframes(list_of_paths, flatten=True)
would be the tool to read the files into a desk frame. Creating the frameworks nice, but computing it in the following Too many open files
error: https://pastebin.com/mfHgB16Q .
When I limit the files to for example 100, it computation works but is slow (30 seconds), on few files, it is no problem. When I use 100 files and increase the basketcase (e.g. 100Mb) to increase the speed, I get a RecursionError
: https://pastebin.com/xTHa1Wav
My own solution was to just create normal pandas data frames with uprooting, delay the creation and use dask to create a concatenate them, which works well for me and results in a faster computation than uproot.daskframes
for large numbers of files.
import uproot
from dask import delayed
import dask.dataframe as dd
def daskframe_from_rootfiles(path_list, treepath, branches=None):
@delayed
def get_df(file, treepath=None, branches=None):
tree = uproot.open(file)[treepath]
return tree.pandas.df(branches=branches)
dfs = [get_df(path, treepath, branches=branches) for path in path_list]
daskframe = dd.from_delayed(dfs)
return daskframe
The advantage of delaying the dataframe creation is that I can use dask to parallelize it.
But I feel like there should be some canonical way and probably something I am missing and maybe there are other options that I should use for the daskframes
function or maybe I should use some other function entirely to do that in. Can you help me with any ideas or best practices?