I have a lot of pickle files. Currently I read them in a loop but it takes a lot of time. I would like to speed it up but don't have any idea how to do that.
Multiprocessing wouldn't work because in order to transfer data from a child subprocess to the main process data need to be serialized (pickled) and deserialized.
Using threading wouldn't help either because of GIL.
I think that the solution would be some library written in C that takes a list of files to read and then runs multiple threads (without GIL). Is there something like this around?
UPDATE Answering your questions:
- Files are partial products of data processing for the purpose of ML
- There are
pandas.Series
objects but the dtype is not known upfront - I want to have many files because we want to pick any subset easily
- I want to have many smaller files instead of one big file because deserialization of one big file takes more memory (at some point in time we have serialized string and deserialized objects)
- The size of the files can vary a lot
- I use python 3.7 so I believe it's cPickle in fact
- Using pickle is very flexible because I don't have to worry about underlying types - I can save anything