1

I'm new to uproot and I am trying to achieve a fairly simply task, but I'm not sure how to do this. Essentially, I have a root file that contains a bunch of histograms and one TTree that is made up of 8 branches for roughly 4 million entries.

What I need to do, I make a new root file, and copy 80% of the TTree from the original file into a TTree (called training) and the remaining 20% into a second TTree in the same new file (called test).

What I have tried is making a directory in python into which I read all of the data from the original file branch by branch. I then used this directory to write the data into the two new TTrees.

This is kind of working, I am getting a file with the structure that I wanted, I'm not entirely satisfied for two reasons:

  • Surely there has to be a more direct way? First reading the data into python and then writing it into a file seems extremely cumbersome and memory intensive.
  • I am honestly not very experienced with root, but from the way I understand it, in my original file, I have a tree that contains my 4 million events. Each event has a value for each branch, so when I say, 'get me entry 555!', I get 8 values (1 for each branch). If I just copy the branches the way I am doing, do I lose this structure or does the index of all the arrays in my directory replace the entry number? So, grabbing the vales from all arrays @ index 555 was the same as returning entry 555 before?

Any help would be welcome. Thanks!

Jim Pivarski
  • 5,568
  • 2
  • 35
  • 47
EichFlo
  • 11
  • 1

1 Answers1

0

This task would always involve reading into memory and writing back out, whether those arrays are in the user's control or hidden.

There's one possible exception, if you want to read TBaskets from one file and write them to another without decompressing them—then they're still in memory but not decompressed and that can be a performance boost. ROOT can do this as a "fast copy," but Uproot doesn't have an equivalent. Such a copy would require that you don't want to modify the data in the TBaskets in any way, including slicing at arbitrary event boundaries, which can be an issue if the TBaskets for the 8 TBranches you're interested in don't line up at common event boundaries. (Such a feature could be added to Uproot—there's no technical limitation, but this feature is only useful in certain cases.)

So the process of reading arrays out of one file and writing them into another is about as good as it gets, with the above caveat.

I am unsure what you mean by a "directory in Python."

To answer your second question, the arrays that are read out of a TTree are aligned in the sense that entry 555 of one TBranch belongs to the same event as entry 555 of another TBranch. This is a common way of working with sets of arrays in NumPy, though it's an uncommon way of working with ROOT data; in ROOT, an event is an object or at least you don't see more than one object at a time.

If you have memory issues (probably not with 8 TBranches × 4 million events, not jagged, = 244 MB of RAM if double precision), then you can consider iterating:

numtraining = int(0.8*ttree.numentries)
numtest = ttree.numentries - numtraining

for chunk in ttree.iterate("*", entrysteps="1 GB", entrystop=numtraining):
    training.extend(chunk)

for chunk in ttree.iterate("*", entrysteps="1 GB", entrystart=numtraining):
    test.extend(chunk)

This gives you control over the size of your output TBaskets, since each TBranch gets one TBasket per call to extend. The above example ensures that a set of TBranches that all have to be used together collectively consist of at most 1 GB.

Unlike a "fast copy" (see above), what you're doing is not just copying, it's also repartitioning the data, which can improve performance when you read those output files. Generally, larger chunks (larger TBaskets) are faster to read, but too large and they can require too much memory.

Jim Pivarski
  • 5,568
  • 2
  • 35
  • 47