1

I just started looking into parallelizing the analysis of rootfiles, i.e. trees. I have worked with RDataFrames where implicit multiprocessing can be enabled with one line of code (EnableImplicitMT()). This works quite well. Now I want to experiment with explicit multiprocessing and uproot to see if efficiency can be boosted even further. I just need some guidance on a sensible approach.

Say I have a really large dataset (can not be read in at once) stored in a root file with a couple of branches. Nothing to crazy hast to be done for the analysis: Some calculations, filtering and then filling some histograms maybe.

The ideas I have:

  1. Trivial parallelizing: Somehow splitting the rootfile in many smaller files and running the same analysis in parallel on all files. In the end recombining the respective results.

  2. Maybe it is possible to read in the file and analyze it in batches as described in the uproot docs but distribute the batches and operations on them to different cores? One could use the python multiprocessing package?

  3. Similiar to 2. read in the file in batches but rather than distributing batches to the cores, slicing up the arrays of one batch and distribute the slices and the operation on them to the cores.

I need some feedback if these approaches are worth trying or if there are better ways of handling large files efficiently.

Jailbone
  • 83
  • 6

1 Answers1

1

A key thing to mind about Uproot is that it isn't a framework for doing HEP analysis—it only reads ROOT files. The HEP analysis is the next step—code beyond any interactions with Uproot.

For the record, Uproot's file-reading can be parallelized, but that just means that multiple threads can be employed to wait for disk/network and decompress data, but the effect is the same: you wait for all the threads to be done to get the chunk of data, maybe a little faster. That's not what you're asking about.

You want your analysis code to be running in parallel, and that's a generic question about parallel processing in Python, not Uproot. You can break your work up into pieces (explicitly), and have each of those pieces independently use Uproot to read the data. Or you can use a Python library for parallel-processing to do it implicitly, such as Dask, or you can use a HEP-specific library that pulls these parts together, such as Coffea.

Jim Pivarski
  • 5,568
  • 2
  • 35
  • 47
  • Thanks for the reply! For a beginner, what would you suggest looking into first? – Jailbone May 15 '22 at 15:42
  • Coffea. Not only does it directly address the problem of parallel processing analysis workflows, but also there are a lot of HEP people involved, as developers and users (and both), and they can help you get started. Something like Dask is more generic, and so the help you could get from the Dask community wouldn't be HEP-specific. Try also the Coffea Users' Meeting: https://indico.cern.ch/category/11674/ – Jim Pivarski May 16 '22 at 16:06
  • Alright, thank you! Is there a forum like place where one can ask questions? – Jailbone May 16 '22 at 23:09
  • https://gitter.im/coffea-hep/community – Jim Pivarski May 17 '22 at 00:03
  • Is this really the place to ask newbie questions? I don't want to clutter the chat. Would you also be willing to help me if I start a topic here on stackoverflow and try to be precise and specific? – Jailbone May 17 '22 at 10:25
  • Three years ago, I tried to encourage the Python-HEP community to use StackOverflow, but that largely failed. Slack (by invite), Mattermost (need CERN account), and Gitter (free for everyone) are where most of the conversation takes place; it would be hard to encourage Coffea users and developers to pay attention to a StackOverflow topic. Don't worry at all about cluttering the Gitter chat! See how the last message was 113 days ago? Coffea doesn't use Gitter as much as Mattermost, but they'll get a notification if you start talking. If it's helpful, I can invite you to Slack. – Jim Pivarski May 17 '22 at 14:35