How to get uproot.iterate() output like the root_numpy root2array() output fast

Question

array2root returns a list of tuples with a dtype containing the branch names. Is there a way to return the same type of format from uproot.iterate() without costly reshaping it afterwards?

Output should be the same as from

array = root2array(['file.root'], treename = 'tree', branches = ['pt', 'eta'])

Which goes like np.array([(pt0, eta0), (pt1, eta1), ... dtype=[('pt', '<f4'), ('eta', '<f4')]]

could you add the image as code please? It will make your question more readable. — Eddy Alleman, Nov 02 '19 at 14:55

Jim Pivarski · Accepted Answer · 2019-11-03T20:11:51.750

If you have an upper bound on how large the array can be (i.e. you're getting it from iterate, so you can pass entrysteps=10000 and know that it will never be larger than 10000), then you can preallocate your array and pass it to uproot and have uproot fill that instead of creating new arrays. In your case, you can make it a record array:

buffer = numpy.empty(20000, dtype=[("pt", "f8"), ("eta", "f8")])
pt_buffer = buffer["pt"]
eta_buffer = buffer["eta"]

The pt_buffer and eta_buffer are views of the buffer, which happen to be interleaved, but they work just as well as arrays. (The reason that I have allocated 20000, rather than just 10000, will be explained below.)

Now say that you're interested in two branches whose default interpretation is uproot.asdtype(">f8", "f8"). Request these arrays with interpretation uproot.asarray(">f8", pt_buffer) and uproot.asarray(">f8", eta_buffer). The first argument is the Numpy dtype that will be used to interpret the raw data from the ROOT file (big-endian, hence the ">") and the second argument is the array you're going to read the data into, in-place.

for arrays in tree.iterate({"pt": uproot.asarray(">f8", pt_buffer),
                            "eta": uproot.asarray(">f8", eta_buffer)},
                           outputtype=tuple, entrysteps=10000):
    start = int((arrays[0].ctypes.data - buffer.ctypes.data) / buffer.itemsize)
    stop = start + len(arrays[0])
    array_of_tuples = buffer[start:stop]
    print(array_of_tuples)

See the documentation on this rarely used and not widely advertised feature.

Even though iterate is filling and sending you arrays in a dict called arrays, they're column-views of the buffer record array ("array of tuples"). By looking at the original buffer, we see the structure that you want.

However, uproot actually fills buffer with whole-basket contents, starting at the beginning of the first relevant basket and ending at the end of the last relevant basket to cover each subrange: [0, 10000), [10000, 20000), [20000, 30000), etc. Therefore the part of buffer that you want may start several entries in (start != 0) and will likely end before 20000 (stop - start != len(buffer)). Since arrays[0] is a view of the first column in buffer containing only the entries that you do want, the difference between arrays[0].ctypes.data and buffer.ctypes.data is the number of bytes into buffer that you want. Dividing by buffer.itemsize gives the number of entries. The ending position is easier to calculate.

The preallocation of buffer has to be big enough to include all the entries you do want and any additional entries that come along with a basket and need to be cut off. 20000 is safe if no basket is larger than 10000. For a given tree, you can determine the largest number of entries in any basket of any branch with:

max(branch.basket_numentries(i) for branch in tree.values()
                                for i in range(branch.numbaskets))

Clearly, that's not what these functions were designed for: asarray was meant for performance, to avoid reallocating big arrays like buffer. It was assumed, however, that you'd want data in columns: the arrays[0] and arrays[0] sent to the body of the for loop. In the above, we additionally want to look at the data formatted as a record array ("array of tuples"), so we're actually looking at this "dumping ground" known as buffer. To do that sensibly—avoiding the entries not relevant for this subrange—we have to explicitly cut them out, and there weren't any functions in the library for figuring out where that subrange is. However, this

    start = int((arrays[0].ctypes.data - buffer.ctypes.data) / buffer.itemsize)
    stop = start + len(arrays[0])
    array_of_tuples = buffer[start:stop]

would be a general implementation of such a function.

Warning: the above is untested—typed into a phone from memory. If you find any errors in it, let's fix it for posterity. Thanks! — Jim Pivarski, Nov 02 '19 at 12:29
I am looking to replace a root_numpy piece of code to get rid of the ROOT dependency, but would like to avoid changing the rest of the rather large codebase. Performance is important due to the converted dataset size. — Andrzej Novák, Nov 02 '19 at 14:38
The above approach still seems to return a tuple of arrays rather than an array of tuples like [(pt1, eta1), (pt2, eta2)...] But also ```gen_u = uproot.iterate(['infile.root'], 'tree', {"fj_pt": uproot.asarrtay(">f8", pt_buffer), "fj_eta": uproot.asarray(">f8", eta_buffer)}, outputtype=tuple, entrysteps=10000)``` throws the following `ValueError: cannot put 10359 items into an array of 10000 items` — Andrzej Novák, Nov 02 '19 at 14:43
In the above method, the `buffer` is filled in place with the data, and by construction, the `buffer` is a record array (an "array of tuples"). I'll change the answer to access the original `buffer`, rather than the separate-column views of it. — Jim Pivarski, Nov 03 '19 at 19:24
Perfect, this is exactly what I've been looking for except for an Assertion error while running `iterate` https://gist.github.com/andrzejnovak/f920a66f2f30c8502d199382556cc16b Setting different entry steps doesn't seem to make a difference, buffer larger than the file size also. Seems to happen half way through the number of events in the file. — Andrzej Novák, Nov 05 '19 at 09:53
I need to change that error message from `assert remainder == 0` to something like "this Interpretation is not valid for this branch" (because the sizes don't line up). In my example above, I was assuming that they're 8-byte float (i.e. `double`) branches; if they're 4-byte floats (i.e. `float`), then you want `f4` and not `f8`. You're replacing `asdtype` with `asarray`, but keeping the dtypes the same. If this worked for one file and not another, it could have been that you need `f4` and your first file had an even number of events. Check the default `branch.interpretation`! — Jim Pivarski, Nov 06 '19 at 10:11

How to get uproot.iterate() output like the root_numpy root2array() output fast

1 Answers1

Linked