1

I wanted to try using uproot to read a number of root files with flat ROOT NTupels into a desk frame. 214 files, 500kb each, about 8000 rows and 16 columns/variables in each. They easily fit in a pandas data frame in memory, but I am trying to learn dask (and uproot, only worked with root_pandas previously) since I expect larger datasets in the future.

So I thought that uproot.daskframes(list_of_paths, flatten=True) would be the tool to read the files into a desk frame. Creating the frameworks nice, but computing it in the following Too many open files error: https://pastebin.com/mfHgB16Q . When I limit the files to for example 100, it computation works but is slow (30 seconds), on few files, it is no problem. When I use 100 files and increase the basketcase (e.g. 100Mb) to increase the speed, I get a RecursionError: https://pastebin.com/xTHa1Wav

My own solution was to just create normal pandas data frames with uprooting, delay the creation and use dask to create a concatenate them, which works well for me and results in a faster computation than uproot.daskframes for large numbers of files.

import uproot
from dask import delayed
import dask.dataframe as dd

def daskframe_from_rootfiles(path_list, treepath, branches=None):
    @delayed
    def get_df(file, treepath=None, branches=None):
        tree = uproot.open(file)[treepath]
        return tree.pandas.df(branches=branches)

    dfs = [get_df(path, treepath, branches=branches) for path in path_list]
    daskframe = dd.from_delayed(dfs)
    return daskframe

The advantage of delaying the dataframe creation is that I can use dask to parallelize it.

But I feel like there should be some canonical way and probably something I am missing and maybe there are other options that I should use for the daskframes function or maybe I should use some other function entirely to do that in. Can you help me with any ideas or best practices?

Michael E.
  • 128
  • 1
  • 7
  • 1
    I think you might be discovering the best practices. Daskframes in uproot were tested for correctness, but not pushed to any extremes—an actual use-case (like yours) is the best way to do that. I'm glad you found a way to get what you want. In my opinion, what you learn from your experiences should _become_ the Right Way To Do It. If you have time to contribute, maybe you could correct the inefficiencies of the daskframes in uproot or in some other way advertise your better way of doing it. (This StackOverflow post is a start; some people will find it.) – Jim Pivarski Feb 12 '20 at 13:46
  • I suspect the difference in efficiencies comes from the fact that the daskframe function builds a dask array for each column in the output of `uproot.lazyarrays` seperately and then uses dask to concatenate the dask arrays column-wise. This means that the dask graph is chunked in a very different way compared to my approach of merging whole dataframes (horizontally vs. vertically), and results in a different computation graph. Which might make sense for some problems. Thanks for the quick reply and keep up your excellent work, I am very thankful for it. – Michael E. Feb 12 '20 at 14:53

2 Answers2

2
    @delayed
    def get_df(file, treepath=None, branches=None):
        tree = uproot.open(file)[treepath]
        return tree.pandas.df(branches=branches)

My guess is that this function is leaving an open file handle. Maybe there is some way to close the file after opening?

    @delayed
    def get_df(filename, treepath=None, branches=None):
        file = uproot.open(filename)
        tree = file[treepath]
        df = tree.pandas.df(branches=branches)
        file.close()  # does something like this exist?
        return df
MRocklin
  • 55,641
  • 23
  • 163
  • 235
1

Jim Pivarski's comment confirms that my approach is okay and that it's not something I am doing completely wrong. Since he is the Dev, I don't expect much more sophisticated answers, so I'll mark this as ansered.

Edit: I can't mark my own answer as the solution until 2 days passed, so I'll wait until then or somebody else posts an answer.

Michael E.
  • 128
  • 1
  • 7