I just started looking into parallelizing the analysis of rootfiles, i.e. trees. I have worked with RDataFrames where implicit multiprocessing can be enabled with one line of code (EnableImplicitMT()
). This works quite well.
Now I want to experiment with explicit multiprocessing and uproot to see if efficiency can be boosted even further. I just need some guidance on a sensible approach.
Say I have a really large dataset (can not be read in at once) stored in a root file with a couple of branches. Nothing to crazy hast to be done for the analysis: Some calculations, filtering and then filling some histograms maybe.
The ideas I have:
Trivial parallelizing: Somehow splitting the rootfile in many smaller files and running the same analysis in parallel on all files. In the end recombining the respective results.
Maybe it is possible to read in the file and analyze it in batches as described in the uproot docs but distribute the batches and operations on them to different cores? One could use the python multiprocessing package?
Similiar to 2. read in the file in batches but rather than distributing batches to the cores, slicing up the arrays of one batch and distribute the slices and the operation on them to the cores.
I need some feedback if these approaches are worth trying or if there are better ways of handling large files efficiently.