Copy TTree to Other File

Question

I'm trying to extract cycles/revisions ("TreeName;3" etc) from one root file and make them their own trees in a new one. I tried doing it by creating a new file and assigning it to a new name, but I get an error telling me that TTree is not writable

with uproot.open("old_file.root") as in_file:
    with uproot.recreate("new_file.root") as out_file:
        for key in in_file.keys():
            ttree = in_file[key]
            new_name = key.replace(";","_")
            out_file[new_name] = ttree

This resulted in NotImplementedError: this ROOT type is not writable: TTree I'm kind of confused because when I print out out_file it tells me that it is a <WritableDirectory '/' ...> I expected it to assign out_file[new_name] to ttree by value. However digging into the documentation "uproot.writing.identify.add_to_directory" says it will raise this error if the object to be added is not writable, so I guess it doesn't just make a copy in memory like I expected it to.

Next I tried to make a new tree first and then move the data in chunk by chunk. However this also didn't work because the tree creation failed:

out_file[new_name] = ttree.typenames()

ValueError: 'extend' must fill every branch with the same number of entries; 'name2' has 7 entries With the typenames being something like {'name1': 'double', 'name2': 'int32_t', 'name3': 'double[]', 'name4': 'int32_t[]', 'name5': 'bool[]'}

Trying to debug it i noticed some very strange behavior

out_file[new_name] = {'name1': 'double', 'name2': 'float32'}

yields the exact same error, while

out_file[new_name] = {'name1': 'float64', 'name2': 'float32'}
out_file[new_name].show()

gives

name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
name1                | uint8_t                  | AsDtype('uint8')
name2                | uint8_t                  | AsDtype('uint8')

so at this point I don't know what a datatype is anymore

Finally I tried doing it by writing the arrays but this failed, too

arrays = ttree.arrays(ttree.keys(),library='np')
out_file[key.replace(";","_")] = arrays

giving TypeError: cannot write Awkward Array type to ROOT file: unknown

With similar issues arising using awkward array or pandas

There isn't a facility for copying whole TTrees from one file to another in Uproot, but perhaps there should be, since this question has been asked a few times. Since there isn't any "copy TTree" implementation, you have to read it into arrays (chunk by chunk, if necessary) and write it back, as you've been attempting to do. — Jim Pivarski, Nov 18 '22 at 15:18
The `typename` is a C++ type; the types that TTree initialization ([mktree](https://uproot.readthedocs.io/en/latest/uproot.writing.writable.WritableDirectory.html#mktree)) takes are NumPy or Awkward types. (It hadn't occurred to me that someone would try using a C++ `typename` there, but this is a good consideration.) So `np.float64` is legal, `"float64"` is legal, `"var * float64"` (for a ragged array) is legal, but `"double"` and `"double[]"` are not. — Jim Pivarski, Nov 18 '22 at 15:22
The `out_file[new_name] = {"name1": array1, "name2": array2}` syntax takes _arrays_ as the values of the dict, not type names. See [WritableDirectory.mktree](https://uproot.readthedocs.io/en/latest/uproot.writing.writable.WritableDirectory.html#mktree) if you want to allocate a TTree before filling it with [WritableTree.extend](https://uproot.readthedocs.io/en/latest/uproot.writing.writable.WritableTree.html#extend). In your case, `'float64'` is interpreted as the array itself, which is 7 `uint8` values (the characters in the string). That was also unanticipated and ought to be prevented. — Jim Pivarski, Nov 18 '22 at 15:25
Thank you so much for those comments @JimPivarski! Now it makes a lot more sense to me happened :) — Chalky, Nov 18 '22 at 16:55

score 0 · Accepted Answer · answered Nov 18 '22 at 16:31

I decided to give a complete working example (following up on comments, above), but found that there are a lot of choices to be made. All you want to do is to copy the input TTree—you don't want to make choices—so you really want a high-level "copy whole TTree" function, but such a function does not exist. (That would be a good addition to Uproot or a new module that uses Uproot to do hadd-type work. A good project if anyone is interested!)

I'm starting with this file, which may be obtained a variety of ways:

file_path = "root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root"

file_path = "http://opendata.cern.ch/record/12341/files/Run2012BC_DoubleMuParked_Muons.root"

file_path = "/tmp/Run2012BC_DoubleMuParked_Muons.root"

It's big enough that it should be copied in chunks, not all at once. The first chunk sets the types, so it can be performed with an assignment of new branch names to arrays, but subsequent chunks have to call WritableFile.extend because you don't want to replace the new TTree, you want to add to it. Neither of these explicitly deal with types; the types are picked up from the array.

Here's a first attempt, using "100 MB" as a chunk size. (This will be the sum of TBasket sizes across TBranches in the output TTree. What we're doing here is more than copying; it's repartitioning the data into a new chunk size.)

with uproot.recreate("/tmp/output.root") as output_file:
    first_chunk = True

    with uproot.open(file_path) as input_file:
        input_ttree = input_file["Events"]

        for arrays_chunk in input_ttree.iterate(step_size="100 MB"):
            if first_chunk:
                output_file["Events"] = arrays_chunk
                first_chunk = False
            else:
                output_file["Events"].extend(arrays_chunk)

However, it fails because assignment and extend expect a dict of arrays, not a single array.

So we could ask TTree.iterate to give us a dict of Awkward Arrays, one for each TBranch, rather than a single Awkward Array that represents all of the TBranches. That would look like this:

with uproot.recreate("/tmp/output.root") as output_file:
    first_chunk = True

    with uproot.open(file_path) as input_file:
        input_ttree = input_file["Events"]

        for dict_of_arrays in input_ttree.iterate(step_size="100 MB", how=dict):
            if first_chunk:
                output_file["Events"] = dict_of_arrays
                first_chunk = False
            else:
                output_file["Events"].extend(dict_of_arrays)

It copies the file, but whereas the original file had TBranches like

name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
nMuon                | uint32_t                 | AsDtype('>u4')
Muon_pt              | float[]                  | AsJagged(AsDtype('>f4'))
Muon_eta             | float[]                  | AsJagged(AsDtype('>f4'))
Muon_phi             | float[]                  | AsJagged(AsDtype('>f4'))
Muon_mass            | float[]                  | AsJagged(AsDtype('>f4'))
Muon_charge          | int32_t[]                | AsJagged(AsDtype('>i4'))

the new file has TBranches like

name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
nMuon                | uint32_t                 | AsDtype('>u4')
nMuon_pt             | int32_t                  | AsDtype('>i4')
Muon_pt              | float[]                  | AsJagged(AsDtype('>f4'))
nMuon_eta            | int32_t                  | AsDtype('>i4')
Muon_eta             | float[]                  | AsJagged(AsDtype('>f4'))
nMuon_phi            | int32_t                  | AsDtype('>i4')
Muon_phi             | float[]                  | AsJagged(AsDtype('>f4'))
nMuon_mass           | int32_t                  | AsDtype('>i4')
Muon_mass            | float[]                  | AsJagged(AsDtype('>f4'))
nMuon_charge         | int32_t                  | AsDtype('>i4')
Muon_charge          | int32_t[]                | AsJagged(AsDtype('>i4'))

What happened is that Uproot didn't know that each of the Awkward Arrays have the same number of items per entry (that the number of pt values in one event is the same as the number of eta values in one event). If the TBranches hadn't all been muons, but some were muons and some were electrons or jets, that wouldn't be true.

The reason these nMuon_pt, nMuon_eta, etc. TBranches are included at all is because ROOT needs them. The Muon_pt, Muon_eta, etc. TBranches are read, in ROOT, as C++ arrays of variable length, and a C++ user needs to know how big to preallocate an array and after which array entry the contents are uninitialized junk. These are not needed in Python (Awkward Array prevents users from seeing uninitialized junk).

So you could ignore them. But if you really need to/want to get rid of them, here's a way: build exactly the array you want to write. Now that we're dealing with types, we'll use WritableDirectory.mktree and specify types explicitly. Since every write is an extend, we won't have to keep track of whether we're writing the first_chunk or a subsequent chunk anymore.

For the Muon_pt, Muon_eta, etc. TBranches to share a counter TBranch, nMuons, you want a Muon field to be an array of variable-length lists of muon objects with pt, eta, etc. fields. That type can be constructed from a string:

import awkward as ak

muons_type = ak.types.from_datashape("""var * {
    pt: float32,
    eta: float32,
    phi: float32,
    mass: float32,
    charge: int32
}""", highlevel=False)

Given a chunk of separated arrays with type var * float32, you can make a single array with type var * {pt: float32, eta: float32, ...} with ak.zip.

muons = ak.zip({
    "pt": chunk["Muon_pt"],
    "eta": chunk["Muon_eta"],
    "phi": chunk["Muon_phi"],
    "mass": chunk["Muon_mass"],
    "charge": chunk["Muon_charge"],
})

(Printing muons.type gives you the type string back.) This is the form you're likely to be using for a data analysis. The assumption was that users would be analyzing data as objects between a read and a write, not reading from one file and writing to another without any modifications.

Here's a reader-writer, using muons_type:

with uproot.recreate("/tmp/output.root") as output_file:
    output_ttree = output_file.mktree("Events", {"Muon": muons_type})

    with uproot.open(file_path) as input_file:
        input_ttree = input_file["Events"]

        for chunk in input_ttree.iterate(step_size="100 MB"):
            muons = ak.zip({
                "pt": chunk["Muon_pt"],
                "eta": chunk["Muon_eta"],
                "phi": chunk["Muon_phi"],
                "mass": chunk["Muon_mass"],
                "charge": chunk["Muon_charge"],
            })

            output_ttree.extend({"Muon": muons})

Or you could have done it without explicitly constructing the muons_type by keeping track of the first_chunk again:

with uproot.recreate("/tmp/output.root") as output_file:
    first_chunk = True

    with uproot.open(file_path) as input_file:
        input_ttree = input_file["Events"]

        for chunk in input_ttree.iterate(step_size="100 MB"):
            muons = ak.zip({
                "pt": chunk["Muon_pt"],
                "eta": chunk["Muon_eta"],
                "phi": chunk["Muon_phi"],
                "mass": chunk["Muon_mass"],
                "charge": chunk["Muon_charge"],
            })

            if first_chunk:
                output_file["Events"] = {"Muon": muons}
                first_chunk = False
            else:
                output_file["Events"].extend({"Muon": muons})

It is admittedly complex (because I'm showing many alternatives, with different pros and cons), but that's because copying TTrees without modification wasn't a foreseen use-case for the TTree-writing functions. Since it is an important use-case, a specialized function that hides these details would be a welcome addition.

Another source of complexity is that ROOT TTrees don't map onto Awkward Array types perfectly (hence, the counter TBranch and all that). The Arrow and Parquet formats are a better match; see [ak.to_parquet](https://awkward-array.readthedocs.io/en/latest/_auto/ak.to_parquet.html) and [ak.from_parquet](https://awkward-array.readthedocs.io/en/latest/_auto/ak.from_parquet.html). — Jim Pivarski, Nov 18 '22 at 16:34
It took me a bit to process and try out all of these option, but thank you for going into so much detail and explaining everything nicely :) An additional complication for me was that I won't know which branches are in the trees or what data types they are. So I for the muons_type option for example I needed to process the type strings from c++ to numpy which was a bit awkward. — Chalky, Nov 19 '22 at 17:24
The `Type` objects can be manipulated as an immutable tree—you can pull contents out and construct new ones by passing contents into their constructors—so it doesn't need to be manipulated via strings. (Object construction is usually more robust than string manipulation; at least, the errors are easier to understand.) Just FYI, in case it helps. — Jim Pivarski, Nov 20 '22 at 19:21

Copy TTree to Other File

1 Answers1