1

I am trying to parallelize a (somewhat) simple script using dask. I originally read my ROOT file using uproot into a pandas.DataFrame in this way:

import uproot

file_name = "ntuple.root"
tree_name = "Events"
branches = ["Trigger", "nTracks", "mass"]
df = uproot.open(file_name)[tree_name].pandas.df(branches)
df.dtypes.to_dict()
# {'Trigger': dtype('bool'),
#  'nTracks': dtype('uint64'),
#  'mass': dtype('float64')}

When I do this, everything works fine, the types of the different branches are correctly recognized. In particular, in the example above, the "Trigger" branch is a boolean, the "nTracks" branch an integer and the "mass" branch a float.

However, when I read the same file using uproot.daskframe I instead get back a daskframe for which all three branches are assigned float64, like this:

df = uproot.daskframe(file_name, tree_name, branches)
df.dtypes.to_dict()
# {'Trigger': dtype('float64'),
#  'nTracks': dtype('float64'),
#  'mass': dtype('float64')}

This makes the script break later, because I need to interpret specifically the trigger branch as a simple boolean, in order to cut on it.

Ideally I would want it to automatically recognize the right dtype for each column.

Alternatively, setting them manually would also be fine for me. I could find out the right dtypes by reading one event the old way and using that. According to the documentation I should be able to pass a dictionary that returns a uproot.interp.interp.Interpretation, but it is unclear to me how exactly that works.

Is this whole mess a limitation of using dask or of uproot?

For reference, I am using uproot 3.11.3 and have dask version 2.6.0 installed.

Graipher
  • 6,891
  • 27
  • 47
  • Since this is short, I'll make it a comment, rather than an answer: I think it's a DataFrame restriction. The same is true of Pandas: _sometimes_ it changes your types because it tries to put multiple columns into a single allocated block behind the scenes. I'm just guessing at this point, but maybe Dask DataFrame does the same sorts of things as Pandas DataFrame. – Jim Pivarski Apr 01 '20 at 17:11
  • @JimPivarski: Well, this also happens if I try to load only a single (non-float) branch. It still has only float columns afterwards. I could understand this behavior if there are mixed dtypes and it tries to find the least restrictive one, but in that case... – Graipher Apr 01 '20 at 17:55

0 Answers0