1

I have a .root file containing a tree named FlatSubstructureJetTreeD file = uproot.open("/data/debo/jetAnomaly/AtlasData/dijets/mergedRoot/miniTrees/user.cdelitzs.JZ2W.mini.root")["FlatSubstructureJetTreeD"]

It has the following branches

file.keys() ['fjet_pt', 'fjet_clus_P', 'fjet_clus_px', 'fjet_clus_py', 'fjet_clus_pz', 'EventInfo_mcEventWeight', 'fjet_xsec_filteff_numevents']

fjet_clus_P,fjet_clus_px,fjet_clus_py,fjet_clus_pz are jagged array (different entries in different events)

I need to make a zero-padded dataset as a form of .h5 file so that each row has entries in format of [fjet_clus_P1,fjet_clus_px1,fjet_clus_py1,fjet_clus_pz1,fjet_clus_P2,fjet_clus_px2,fjet_clus_py2,fjet_clus_pz2,....,fjet_clus_Pn,fjet_clus_pxn,fjet_clus_pyn,fjet_clus_pzn], could you suggest what would be the smartest and memory-efficient way to do so in uproot?

Thanks, Debo.

1 Answers1

0

Assuming that you have read out all of the arrays as a dict named arrays,

import uproot
file = uproot.open("/data/debo/jetAnomaly/AtlasData/dijets/mergedRoot/"
            "miniTrees/user.cdelitzs.JZ2W.mini.root"
           )["FlatSubstructureJetTreeD"]
arrays = file.arrays(["fjet_clus_P", "fjet_clus_px", "fjet_clus_py",
                      "fjet_clus_pz"], namedecode="utf-8")

You can None-pad each array with the pad method and then turn the None values into zeros with fillna. In pad, you have to specify a length; let's take the maximum length per event. After this operation, the JaggedArrays happen to have equal lengths in the second dimension, so turn them into NumPy arrays with regular.

for name in arrays:
    longest = arrays[name].counts.max()
    arrays[name] = arrays[name].pad(longest).fillna(0).regular()

Now that they're (two-dimensional) NumPy arrays, h5py will recognize them and you can write them to the HDF5 file the normal way.

Edit: And if you want all the data in a contiguous block array, you'll have to choose a single longest length, preallocate the block, and fill it. (The call to regular ought to be optional now, but check.)

longest = 2
for name in arrays:
    arrays[name] = arrays[name].pad(longest).fillna(0)

output = numpy.empty(file.numentries,
                     dtype=[("px1", "f8"), ("py1", "f8"), ("pz1", "f8"),
                            ("px2", "f8"), ("py2", "f8"), ("pz2", "f8")])
output["px1"] = arrays["fjet_clus_px"][:, 0]
output["py1"] = arrays["fjet_clus_py"][:, 0]
output["py1"] = arrays["fjet_clus_pz"][:, 0]
output["px2"] = arrays["fjet_clus_px"][:, 1]
output["py2"] = arrays["fjet_clus_py"][:, 1]
output["py2"] = arrays["fjet_clus_pz"][:, 1]

This is vectorized (i.e. no Python for loops, implicit or explicit). Even if you write a loop over all column names, there's only 10-ish columns, but probably millions or billions of rows.

Jim Pivarski
  • 5,568
  • 2
  • 35
  • 47
  • Thanks, But I need a little modification because the way you suggested fo each row it will first keep all Ps then all pxs, all pys and pzs respectively but I want to have an array with each row as P1,px1,py1,pz1,P2,px2,py2,pz2,....,Pn,pxn,pyn,pzn so if I write a loop it would be, `for i in range (0,P.shape) for j in range (0,P[i].shape) write in 2D array P[i,j],px[i,j],py[i,j],pz[i,j]` Is there a way in uproot where I can do it without making a for loop I mean in a vectorized way? Thanks, Debo. shareeditdeleteflag – Debottam Bakshi Gupta Nov 22 '19 at 23:21
  • Hi @jim for some root file I am getting memory error arrays[name] = arrays[name].pad(longest).fillna(0).regular() File "/home/debo/env_autoencoder/local/lib/python2.7/site-packages/awkward/array/jagged.py", line 1876, in pad content = self._content[index] MemoryError, Do you have any suggestion to make it memory efficient like making it chunk by chunk – Debottam Bakshi Gupta Nov 24 '19 at 21:25
  • Yes: instead of calling `file.arrays` and doing everything once, call `file.iterate` in a loop. See the [uproot README](https://github.com/scikit-hep/uproot#readme) for examples. – Jim Pivarski Nov 25 '19 at 20:00