If you have an upper bound on how large the array can be (i.e. you're getting it from iterate
, so you can pass entrysteps=10000
and know that it will never be larger than 10000
), then you can preallocate your array and pass it to uproot and have uproot fill that instead of creating new arrays. In your case, you can make it a record array:
buffer = numpy.empty(20000, dtype=[("pt", "f8"), ("eta", "f8")])
pt_buffer = buffer["pt"]
eta_buffer = buffer["eta"]
The pt_buffer
and eta_buffer
are views of the buffer
, which happen to be interleaved, but they work just as well as arrays. (The reason that I have allocated 20000
, rather than just 10000
, will be explained below.)
Now say that you're interested in two branches whose default interpretation
is uproot.asdtype(">f8", "f8")
. Request these arrays with interpretation uproot.asarray(">f8", pt_buffer)
and uproot.asarray(">f8", eta_buffer)
. The first argument is the Numpy dtype that will be used to interpret the raw data from the ROOT file (big-endian, hence the ">"
) and the second argument is the array you're going to read the data into, in-place.
for arrays in tree.iterate({"pt": uproot.asarray(">f8", pt_buffer),
"eta": uproot.asarray(">f8", eta_buffer)},
outputtype=tuple, entrysteps=10000):
start = int((arrays[0].ctypes.data - buffer.ctypes.data) / buffer.itemsize)
stop = start + len(arrays[0])
array_of_tuples = buffer[start:stop]
print(array_of_tuples)
See the documentation on this rarely used and not widely advertised feature.
Even though iterate
is filling and sending you arrays in a dict called arrays
, they're column-views of the buffer
record array ("array of tuples"). By looking at the original buffer
, we see the structure that you want.
However, uproot actually fills buffer
with whole-basket contents, starting at the beginning of the first relevant basket and ending at the end of the last relevant basket to cover each subrange: [0, 10000)
, [10000, 20000)
, [20000, 30000)
, etc. Therefore the part of buffer
that you want may start several entries in (start != 0
) and will likely end before 20000
(stop - start != len(buffer)
). Since arrays[0]
is a view of the first column in buffer
containing only the entries that you do want, the difference between arrays[0].ctypes.data
and buffer.ctypes.data
is the number of bytes into buffer
that you want. Dividing by buffer.itemsize
gives the number of entries. The ending position is easier to calculate.
The preallocation of buffer
has to be big enough to include all the entries you do want and any additional entries that come along with a basket and need to be cut off. 20000
is safe if no basket is larger than 10000
. For a given tree
, you can determine the largest number of entries in any basket of any branch with:
max(branch.basket_numentries(i) for branch in tree.values()
for i in range(branch.numbaskets))
Clearly, that's not what these functions were designed for: asarray
was meant for performance, to avoid reallocating big arrays like buffer
. It was assumed, however, that you'd want data in columns: the arrays[0]
and arrays[0]
sent to the body of the for loop. In the above, we additionally want to look at the data formatted as a record array ("array of tuples"), so we're actually looking at this "dumping ground" known as buffer
. To do that sensibly—avoiding the entries not relevant for this subrange—we have to explicitly cut them out, and there weren't any functions in the library for figuring out where that subrange is. However, this
start = int((arrays[0].ctypes.data - buffer.ctypes.data) / buffer.itemsize)
stop = start + len(arrays[0])
array_of_tuples = buffer[start:stop]
would be a general implementation of such a function.