Assuming that you have read out all of the arrays as a dict named arrays
,
import uproot
file = uproot.open("/data/debo/jetAnomaly/AtlasData/dijets/mergedRoot/"
"miniTrees/user.cdelitzs.JZ2W.mini.root"
)["FlatSubstructureJetTreeD"]
arrays = file.arrays(["fjet_clus_P", "fjet_clus_px", "fjet_clus_py",
"fjet_clus_pz"], namedecode="utf-8")
You can None
-pad each array with the pad
method and then turn the None
values into zeros with fillna
. In pad
, you have to specify a length; let's take the maximum length per event. After this operation, the JaggedArrays
happen to have equal lengths in the second dimension, so turn them into NumPy arrays with regular
.
for name in arrays:
longest = arrays[name].counts.max()
arrays[name] = arrays[name].pad(longest).fillna(0).regular()
Now that they're (two-dimensional) NumPy arrays, h5py will recognize them and you can write them to the HDF5 file the normal way.
Edit: And if you want all the data in a contiguous block array, you'll have to choose a single longest
length, preallocate the block, and fill it. (The call to regular
ought to be optional now, but check.)
longest = 2
for name in arrays:
arrays[name] = arrays[name].pad(longest).fillna(0)
output = numpy.empty(file.numentries,
dtype=[("px1", "f8"), ("py1", "f8"), ("pz1", "f8"),
("px2", "f8"), ("py2", "f8"), ("pz2", "f8")])
output["px1"] = arrays["fjet_clus_px"][:, 0]
output["py1"] = arrays["fjet_clus_py"][:, 0]
output["py1"] = arrays["fjet_clus_pz"][:, 0]
output["px2"] = arrays["fjet_clus_px"][:, 1]
output["py2"] = arrays["fjet_clus_py"][:, 1]
output["py2"] = arrays["fjet_clus_pz"][:, 1]
This is vectorized (i.e. no Python for loops, implicit or explicit). Even if you write a loop over all column names, there's only 10-ish columns, but probably millions or billions of rows.