I decided to give a complete working example (following up on comments, above), but found that there are a lot of choices to be made. All you want to do is to copy the input TTree—you don't want to make choices—so you really want a high-level "copy whole TTree" function, but such a function does not exist. (That would be a good addition to Uproot or a new module that uses Uproot to do hadd
-type work. A good project if anyone is interested!)
I'm starting with this file, which may be obtained a variety of ways:
file_path = "root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root"
file_path = "http://opendata.cern.ch/record/12341/files/Run2012BC_DoubleMuParked_Muons.root"
file_path = "/tmp/Run2012BC_DoubleMuParked_Muons.root"
It's big enough that it should be copied in chunks, not all at once. The first chunk sets the types, so it can be performed with an assignment of new branch names to arrays, but subsequent chunks have to call WritableFile.extend because you don't want to replace the new TTree, you want to add to it. Neither of these explicitly deal with types; the types are picked up from the array.
Here's a first attempt, using "100 MB"
as a chunk size. (This will be the sum of TBasket sizes across TBranches in the output TTree. What we're doing here is more than copying; it's repartitioning the data into a new chunk size.)
with uproot.recreate("/tmp/output.root") as output_file:
first_chunk = True
with uproot.open(file_path) as input_file:
input_ttree = input_file["Events"]
for arrays_chunk in input_ttree.iterate(step_size="100 MB"):
if first_chunk:
output_file["Events"] = arrays_chunk
first_chunk = False
else:
output_file["Events"].extend(arrays_chunk)
However, it fails because assignment and extend
expect a dict of arrays, not a single array.
So we could ask TTree.iterate to give us a dict of Awkward Arrays, one for each TBranch, rather than a single Awkward Array that represents all of the TBranches. That would look like this:
with uproot.recreate("/tmp/output.root") as output_file:
first_chunk = True
with uproot.open(file_path) as input_file:
input_ttree = input_file["Events"]
for dict_of_arrays in input_ttree.iterate(step_size="100 MB", how=dict):
if first_chunk:
output_file["Events"] = dict_of_arrays
first_chunk = False
else:
output_file["Events"].extend(dict_of_arrays)
It copies the file, but whereas the original file had TBranches like
name | typename | interpretation
---------------------+--------------------------+-------------------------------
nMuon | uint32_t | AsDtype('>u4')
Muon_pt | float[] | AsJagged(AsDtype('>f4'))
Muon_eta | float[] | AsJagged(AsDtype('>f4'))
Muon_phi | float[] | AsJagged(AsDtype('>f4'))
Muon_mass | float[] | AsJagged(AsDtype('>f4'))
Muon_charge | int32_t[] | AsJagged(AsDtype('>i4'))
the new file has TBranches like
name | typename | interpretation
---------------------+--------------------------+-------------------------------
nMuon | uint32_t | AsDtype('>u4')
nMuon_pt | int32_t | AsDtype('>i4')
Muon_pt | float[] | AsJagged(AsDtype('>f4'))
nMuon_eta | int32_t | AsDtype('>i4')
Muon_eta | float[] | AsJagged(AsDtype('>f4'))
nMuon_phi | int32_t | AsDtype('>i4')
Muon_phi | float[] | AsJagged(AsDtype('>f4'))
nMuon_mass | int32_t | AsDtype('>i4')
Muon_mass | float[] | AsJagged(AsDtype('>f4'))
nMuon_charge | int32_t | AsDtype('>i4')
Muon_charge | int32_t[] | AsJagged(AsDtype('>i4'))
What happened is that Uproot didn't know that each of the Awkward Arrays have the same number of items per entry (that the number of pt
values in one event is the same as the number of eta
values in one event). If the TBranches hadn't all been muons, but some were muons and some were electrons or jets, that wouldn't be true.
The reason these nMuon_pt
, nMuon_eta
, etc. TBranches are included at all is because ROOT needs them. The Muon_pt
, Muon_eta
, etc. TBranches are read, in ROOT, as C++ arrays of variable length, and a C++ user needs to know how big to preallocate an array and after which array entry the contents are uninitialized junk. These are not needed in Python (Awkward Array prevents users from seeing uninitialized junk).
So you could ignore them. But if you really need to/want to get rid of them, here's a way: build exactly the array you want to write. Now that we're dealing with types, we'll use WritableDirectory.mktree and specify types explicitly. Since every write is an extend
, we won't have to keep track of whether we're writing the first_chunk
or a subsequent chunk anymore.
For the Muon_pt
, Muon_eta
, etc. TBranches to share a counter TBranch, nMuons
, you want a Muon
field to be an array of variable-length lists of muon objects with pt
, eta
, etc. fields. That type can be constructed from a string:
import awkward as ak
muons_type = ak.types.from_datashape("""var * {
pt: float32,
eta: float32,
phi: float32,
mass: float32,
charge: int32
}""", highlevel=False)
Given a chunk
of separated arrays with type var * float32
, you can make a single array with type var * {pt: float32, eta: float32, ...}
with ak.zip.
muons = ak.zip({
"pt": chunk["Muon_pt"],
"eta": chunk["Muon_eta"],
"phi": chunk["Muon_phi"],
"mass": chunk["Muon_mass"],
"charge": chunk["Muon_charge"],
})
(Printing muons.type
gives you the type string back.) This is the form you're likely to be using for a data analysis. The assumption was that users would be analyzing data as objects between a read and a write, not reading from one file and writing to another without any modifications.
Here's a reader-writer, using muons_type
:
with uproot.recreate("/tmp/output.root") as output_file:
output_ttree = output_file.mktree("Events", {"Muon": muons_type})
with uproot.open(file_path) as input_file:
input_ttree = input_file["Events"]
for chunk in input_ttree.iterate(step_size="100 MB"):
muons = ak.zip({
"pt": chunk["Muon_pt"],
"eta": chunk["Muon_eta"],
"phi": chunk["Muon_phi"],
"mass": chunk["Muon_mass"],
"charge": chunk["Muon_charge"],
})
output_ttree.extend({"Muon": muons})
Or you could have done it without explicitly constructing the muons_type
by keeping track of the first_chunk
again:
with uproot.recreate("/tmp/output.root") as output_file:
first_chunk = True
with uproot.open(file_path) as input_file:
input_ttree = input_file["Events"]
for chunk in input_ttree.iterate(step_size="100 MB"):
muons = ak.zip({
"pt": chunk["Muon_pt"],
"eta": chunk["Muon_eta"],
"phi": chunk["Muon_phi"],
"mass": chunk["Muon_mass"],
"charge": chunk["Muon_charge"],
})
if first_chunk:
output_file["Events"] = {"Muon": muons}
first_chunk = False
else:
output_file["Events"].extend({"Muon": muons})
It is admittedly complex (because I'm showing many alternatives, with different pros and cons), but that's because copying TTrees without modification wasn't a foreseen use-case for the TTree-writing functions. Since it is an important use-case, a specialized function that hides these details would be a welcome addition.