Fast iterative file reading in python

Question

I asked a question here about how to read in a very large file to python, and I got a response based on zip_longest.

The problem is that this solution is extremely slow - it took keras' model.predict >2 hours to process 200,000 lines in a file which normally takes <3 minutes when the file is loaded directly into memory, and I want to be able to process files 5x this size.

I've since found the chunking functions in pandas but I don't understand how to load a chunk of a file, reshape the data and then use it using these methods, and I also don't know if this will be the fastest way of reading and using the data in a very large file.

Any fast solutions to this problem are welcome.

score 0 · Answer 1 · answered Nov 07 '20 at 15:40

0

If you are looking for fast performing iterative python functions, you should check out the itertools package + documentation. Im pretty sure it doesn't get much faster than that.

But be aware that - if you neglect any kind of preprocessing or reshaping- you will hit a maximum of performance when dealing with large datasets. Just imagine your 2e5 lines file contains only one character (1 Byte) of information. That still makes 200 MB of information to read, which is the lower bound imaginable for you, if i get that correctly. So you will have to face long interpreting times if you get that amount to 3 or 4 GB of information in one go.

answered Nov 07 '20 at 15:40

Voidling

72
7

zip_longest is from the iterools package. – Erfan Nov 07 '20 at 15:45
There may well be a way of reorganising the data to speed up processing, but I don't know what that is. I produce the data using c++ ROOT and each line contains one float. Is it inevitable that opening the file in batches will always be slower than loading the whole file into memory? – Beth Long Nov 07 '20 at 15:49
Surely if I increase the file size by 5x the worst I can reasonably expect is 5x slower, not 500 times, which is the current situation? I mean, if I opened and closed 5 files of 200,000 lines each then I wouldn't expect the process to take 500 times longer than opening and closing one 200,000 line file – Beth Long Nov 07 '20 at 15:52

Fast iterative file reading in python

1 Answers1