I am trying to parallelize a (somewhat) simple script using dask
.
I originally read my ROOT file using uproot
into a pandas.DataFrame
in this way:
import uproot
file_name = "ntuple.root"
tree_name = "Events"
branches = ["Trigger", "nTracks", "mass"]
df = uproot.open(file_name)[tree_name].pandas.df(branches)
df.dtypes.to_dict()
# {'Trigger': dtype('bool'),
# 'nTracks': dtype('uint64'),
# 'mass': dtype('float64')}
When I do this, everything works fine, the types of the different branches are correctly recognized. In particular, in the example above, the "Trigger"
branch is a boolean, the "nTracks"
branch an integer and the "mass"
branch a float.
However, when I read the same file using uproot.daskframe
I instead get back a daskframe for which all three branches are assigned float64
, like this:
df = uproot.daskframe(file_name, tree_name, branches)
df.dtypes.to_dict()
# {'Trigger': dtype('float64'),
# 'nTracks': dtype('float64'),
# 'mass': dtype('float64')}
This makes the script break later, because I need to interpret specifically the trigger branch as a simple boolean, in order to cut on it.
Ideally I would want it to automatically recognize the right dtype for each column.
Alternatively, setting them manually would also be fine for me. I could find out the right dtypes by reading one event the old way and using that. According to the documentation I should be able to pass a dictionary that returns a uproot.interp.interp.Interpretation
, but it is unclear to me how exactly that works.
Is this whole mess a limitation of using dask
or of uproot
?
For reference, I am using uproot 3.11.3 and have dask version 2.6.0 installed.