I am reading many (say 1k) CERN ROOT files using a loop and storing some data into a nested NumPy array. The use of loops makes it serial task and each file take quite some time to complete the process. Since I am working on a deep learning model, I must create a large enough dataset - but the reading time itself is taking a very long time (reading 835 events takes about 21 minutes). Can anyone please suggest if it is possible to use multiple GPUs to read the data, so that less time is required for the reading? If so, how?
Adding some more details: I pushed to program to GitHub so that this can be seen (please let me know if posting GitHub link is not allowed, in that case, I will post the relevant portion here):
https://github.com/Kolahal/SupervisedCounting/blob/master/read_n_train.py
I run the program as:
python read_n_train.py <input-file-list>
where the argument is a text file containing the list of the files with addresses. I was opening the ROOT files in a loop in the read_data_into_list()
function. But as I mentioned, this serial task is consuming a lot of time. Not only that, I notice that the reading speed is getting worse as we read more and more data.
Meanwhile I tried to used slurmpy package https://github.com/brentp/slurmpy With this, I can distribute the job into, say, N worker nodes, for example. In this case, an individual reading program will read the file assigned to it and will return a corresponding list. It is just that in the end, I need to add the lists. I couldn't figure out a way to do this.
Any help is highly appreciated.
Regards, Kolahal