Reading multiple (CERN) ROOT files into NumPy array - using n nodes and say, 2n GPUs

Question

I am reading many (say 1k) CERN ROOT files using a loop and storing some data into a nested NumPy array. The use of loops makes it serial task and each file take quite some time to complete the process. Since I am working on a deep learning model, I must create a large enough dataset - but the reading time itself is taking a very long time (reading 835 events takes about 21 minutes). Can anyone please suggest if it is possible to use multiple GPUs to read the data, so that less time is required for the reading? If so, how?

Adding some more details: I pushed to program to GitHub so that this can be seen (please let me know if posting GitHub link is not allowed, in that case, I will post the relevant portion here):

https://github.com/Kolahal/SupervisedCounting/blob/master/read_n_train.py

I run the program as:

python read_n_train.py <input-file-list>

where the argument is a text file containing the list of the files with addresses. I was opening the ROOT files in a loop in the read_data_into_list() function. But as I mentioned, this serial task is consuming a lot of time. Not only that, I notice that the reading speed is getting worse as we read more and more data.

Meanwhile I tried to used slurmpy package https://github.com/brentp/slurmpy With this, I can distribute the job into, say, N worker nodes, for example. In this case, an individual reading program will read the file assigned to it and will return a corresponding list. It is just that in the end, I need to add the lists. I couldn't figure out a way to do this.

Any help is highly appreciated.

Regards, Kolahal

Please show the code you are currently using, and if possible an example data file. Also, ROOT reading can be fairly efficient, chances are in 20 minutes you could read much more than you can hold in memory. — Keldorn, Feb 17 '19 at 15:31
Hello Keldorn, I added some details. I will appreciate if you can suggest something. — kolahalb, Feb 18 '19 at 03:53

Keldorn · Answer 1 · 2019-02-18T22:04:18.820

1

You're looping over all the events sequentially from python, that's probably the bottleneck.

You can look into root_numpy to load the data you need from the root file into numpy arrays:

root_numpy is a Python extension module that provides an efficient interface between ROOT and NumPy. root_numpy’s internals are compiled C++ and can therefore handle large amounts of data much faster than equivalent pure Python implementations.

I'm also currently looking at root_pandas which seems similar.

While this solution does not precisely answer the request for parallelization, it may make the parallelization unnecessary. And if it is still too slow, then it can still be used on parallel using slurm or something else.

edited Feb 18 '19 at 22:04

answered Feb 18 '19 at 04:04

Keldorn

1,980
15
25

Hello Keldorn, thanks for the suggestion. I have not used root_numpy before and I will take a look at this...By the way, can you tell if it is possible to concatenate the bash prompt outputs from different slurm tasks? – kolahalb Feb 18 '19 at 05:17
I don't understand the question. Just concatenate files? `cat a.log b.log > c.log`? – Keldorn Feb 18 '19 at 15:14
Hello Keldorn, well actually I meant if it is possible to combine the lists or the numpy arrays. In fact, the return of the reading function is "lists of lists of lists". In slurmpy, I can run slurm.run("bash command"). e.g. slurm.run("output = python read_file.py (input-file-list); echo $output") produces the lists in separate log files. Like [0,2,3... 2,0,8], say. Now, If I cat these files, I get a file where these lists are not combined into one - they are just written one after another. It would be best if I could read these outputs in my program itself as list and combine them into one. – kolahalb Feb 18 '19 at 18:47
In fact, after writing to you, I am wondering it should be possible to directly read the output into a list in my program, over all the logs. Looping over may be unavoidable, though. – kolahalb Feb 18 '19 at 18:53
Hello Keldorn, thanks for pointing out that it could easily be done from the log files. I personally did not want to work with the text files and was thinking of using python list directly, or using h5 files. Once the job writes the output files, in the same python session I can interactively read from those log files into python lists (using ast.literal_eval(log-file-name)) function. I don't know if this approach will pay off in terms of time, but in principle it is doable. I will investigate if one can save more time using root_numpy. – kolahalb Feb 18 '19 at 21:00
Please focus on one question at a time, maybe [take the tour](https://stackoverflow.com/tour) and read about [asking questions on Stack Overflow](https://stackoverflow.com/help/asking). The idea is to make questions helpful for other, not just personal help for individuals. – Keldorn Feb 18 '19 at 21:34
Hello Keldorn, oh yes. Thank you for the suggestion...I will keep in mind. – kolahalb Feb 18 '19 at 21:58

Reading multiple (CERN) ROOT files into NumPy array - using n nodes and say, 2n GPUs

1 Answers1