17

I have a lot of pickle files. Currently I read them in a loop but it takes a lot of time. I would like to speed it up but don't have any idea how to do that.

Multiprocessing wouldn't work because in order to transfer data from a child subprocess to the main process data need to be serialized (pickled) and deserialized.

Using threading wouldn't help either because of GIL.

I think that the solution would be some library written in C that takes a list of files to read and then runs multiple threads (without GIL). Is there something like this around?

UPDATE Answering your questions:

  • Files are partial products of data processing for the purpose of ML
  • There are pandas.Series objects but the dtype is not known upfront
  • I want to have many files because we want to pick any subset easily
  • I want to have many smaller files instead of one big file because deserialization of one big file takes more memory (at some point in time we have serialized string and deserialized objects)
  • The size of the files can vary a lot
  • I use python 3.7 so I believe it's cPickle in fact
  • Using pickle is very flexible because I don't have to worry about underlying types - I can save anything
user2146414
  • 840
  • 11
  • 29
  • Does this help? https://stackoverflow.com/a/50479955/3288092 – BernardL Feb 24 '21 at 10:01
  • @BernardL Not really. I read data from one disc and don't see any gain using threads. I think that decompression and deserialization is run under the GIL and IO has lower impact on the total time. – user2146414 Feb 24 '21 at 11:23
  • I think this process is more I/O bound then processing bound. – SaGaR Feb 27 '21 at 09:02
  • If the bottleneck involves primarily creating Python objects from the pickle data, I can't think of anything you can do without rearchitecting your code in some way or switch to a version of Python that does not impose the limitations of the GIL. – CryptoFool Feb 28 '21 at 18:30
  • Can you tell if these all pickled files represent same Python object type or different types? Also can you share more details, like what is the average size of pickled file, what and how many objects are inside each file. How many files there are in total to unpickle? Also do you have HDD or SSD? If you have HDD, have you tried just storing all pickled files into one joint file like .tar archive, this will improve reading speed greatly. – Arty Feb 28 '21 at 18:33
  • 1
    What's in the pickle files? I mean what kind of objects? Have you tried `cpickle`? – Mark Setchell Feb 28 '21 at 18:33
  • @Arty - I think we're assuming that I/O isn't the issue here. If it were, there would be fairly trivial multithreading solutions that would be helpful (see the link in the first comment), and wouldn't require reworking the input data, which we don't know if would be hard or impossible for the OP to do. - Although, you could still have something here if the pickle files are fairly small and large in number. Then part of the speed problem might be processing all of the files rather than the I/O itself. So maybe the OP can tell us more about the pickle files. – CryptoFool Feb 28 '21 at 18:40
  • I think the idea of `cpickle` is promising. It addresses speeding up the non-I/O portion of the operation, which is what you're trying to figure out how to do, right? – CryptoFool Feb 28 '21 at 18:46
  • @CryptoFool It happens very often that sometime you missdetecting real problem. Maybe OP just started simple loop and it was slow. Without debugging real problem, where was a bottleneck. So I don't think that we can assume anything here. Bottleneck in real systems can be anywhere. Of cause maybe OP's questions looks like he wants to optimize script performance. But really he almost for sure just wants that his task is solved fast in total on his system, not just to speed up only script. So we should try to check all possible reasons without assumptions. – Arty Feb 28 '21 at 18:46
  • @Arty - I don't disagree if the act of converting all of their files to a single TAR isn't problematic. I would think, however, that in many workflows, it would be. – CryptoFool Feb 28 '21 at 18:49
  • @CryptoFool - unless I'm missing something, cpickle is now part of pickle. See answer and comments here: https://stackoverflow.com/questions/37132899/installing-cpickle-with-python-3-5#37138791 – hrokr Feb 28 '21 at 22:06
  • @hrokr - well, that's bad news for the OP, since it means that can't help them :(. Thanks for pointing that out. I've never used 'pickle' for serious work, but if I ever do, now I'll know to not think about explicitly installing and using `cPickle`. Thanks much for the info! – CryptoFool Feb 28 '21 at 22:08
  • @CryptoFool - Maybe not. The metrics for quickle and pyrobuf are shown against the current version of quickle. So, an increase is possible but rather than the low hanging fruit of a 10x, it would seem time can be cut, but only by another 4x or so. Depending on the data source, more should be possible. I know MessagePack works quite well with JSON – hrokr Feb 28 '21 at 23:33
  • @hrokr - I concur, if the OP isn't stuck with pickle output as their input format. – CryptoFool Mar 01 '21 at 00:40
  • @user2146414 Can you tell if you can use multiprocessing at all for your task? If to send unpickled data to main process it is of cause needs one more time of re-pickling and will not give any improvements. But what about doing all later work right in all processes without gathering data into main process? For example if your pickled data contains images and you want to apply some convolutional filter to it, then inside each process (when using multiprocessing) you can unpickle data and then apply filter right in this same process where data was unpickled. Can you do such thing? – Arty Mar 01 '21 at 04:25
  • @user2146414 Can you be specific on what type of data are you trying to pickle like it is large chunk of data in every file or is it small chunks of data in each file. – SaGaR Mar 01 '21 at 06:34

5 Answers5

5

I agree with what has been noted in the comments, namely that due to the constraint of python itself (chiefly, the GIL lock, as you noted) and there may simply be no faster loading the information beyond what you are doing now. Or, if there is a way, it may be both highly technical and, in the end, only gives you a modest increase in speed.

That said, depending on the datatypes you have, it may be faster to use quickle or pyrobuf.

hrokr
  • 3,276
  • 3
  • 21
  • 39
  • 1
    ...or `cpickle`, as @MarkSetchell suggests. If I'm reading correctly, `cpickle` would be compatible with the existing data. - It seems that `pyrobuf` requires `Cython`, which would eliminate the GIL and therefore completely change the nature of the problem. – CryptoFool Feb 28 '21 at 18:50
  • @CryptoFool - that's worth adding but I haven't used it but for a different reason: pickle and (and apparently cpickle) automatically run the code. That is something that makes me cringe every time. If it's just my stuff, sure. But if I'm sending or receiving something, that's a risk that I'm not keen on taking. – hrokr Feb 28 '21 at 18:59
  • 1
    @MarkSetchell - I was having a problem finding a repo for cipickle. Apparently, pickle now uses cpickel internally (https://stackoverflow.com/questions/37132899/installing-cpickle-with-python-3-5#37138791) and has been doing so for some time now. So that doesn't appear to be of any benefit. Does that match with your experiece? – hrokr Feb 28 '21 at 19:57
4

I think that the solution would be some library written in C that takes a list of files to read and then runs multiple threads (without GIL). Is there something like this around?

In short: no. pickle is apparently good enough for enough people that there are no major alternate implementations fully compatible with the pickle protocol. As of sometime in python 3, cPickle was merged with pickle, and neither release the GIL anyway which is why threading won't help you (search for Py_BEGIN_ALLOW_THREADS in _pickle.c and you will find nothing).

If your data can be re-structured into a simpler data format like csv, or a binary format like numpy's npy, there will be less cpu overhead when reading your data. Pickle is built for flexibility first rather than speed or compactness first. One possible exception to the rule of more complex less speed is the HDF5 format using h5py, which can be fairly complex, and I have used to max out the bandwidth of a sata ssd.

Finally you mention you have many many pickle files, and that itself is probably causing no small amount of overhead. Each time you open a new file, there's some overhead involved from the operating system. Conveniently you can combine pickle files by simply appending them together. Then you can call Unpickler.load() until you reach the end of the file. Here's a quick example of combining two pickle files together using shutil

import pickle, shutil, os

#some dummy data
d1 = {'a': 1, 'b': 2, 1: 'a', 2: 'b'}
d2 = {'c': 3, 'd': 4, 3: 'c', 4: 'd'}

#create two pickles
with open('test1.pickle', 'wb') as f:
    pickle.Pickler(f).dump(d1)
with open('test2.pickle', 'wb') as f:
    pickle.Pickler(f).dump(d2)
    
#combine list of pickle files
with open('test3.pickle', 'wb') as dst:
    for pickle_file in ['test1.pickle', 'test2.pickle']:
        with open(pickle_file, 'rb') as src:
            shutil.copyfileobj(src, dst)
            
#unpack the data
with open('test3.pickle', 'rb') as f:
    p = pickle.Unpickler(f)
    while True:
        try:
            print(p.load())
        except EOFError:
            break
        
#cleanup
os.remove('test1.pickle')
os.remove('test2.pickle')
os.remove('test3.pickle')
Aaron
  • 10,133
  • 1
  • 24
  • 40
  • That's not what the metrics for competing projects show. – hrokr Mar 03 '21 at 02:54
  • @hrokr if there are any major projects that are **fully** compatible with the pickle protocol that are faster than `pickle` I am not aware of them. `quickle` and `pyrobuf` would fall under the second paragraph encouraging the transition to another format that has a faster, more efficient deserialization. – Aaron Mar 03 '21 at 14:29
  • If you look at edits to the question, you'll note the requirement was added five days *after* the original question was asked. And, while I understand the OP might want something that can handle any data type most things are optimized for speed in one area or another -- which is what and why several people have asked. – hrokr Mar 03 '21 at 20:46
  • @Aaron Thanks for pointing out the lack of `Py_BEGIN_ALLOW_THREADS` that indicates that trying to create C module using code from `_pickle.c` won't help. – user2146414 Mar 07 '21 at 07:11
2

I think you should try and use mmap(memory mapped files) that is similar to open() but way faster.

Note: If your each file is big in size then use mmap otherwise If the files are small in size use regular methods.

I have written a sample that you can try.

import mmap
from time import perf_counter as pf
def load_files(filelist):
    start = pf() # for rough time calculations
    for filename in filelist:
        with open(filename, mode="r", encoding="utf8") as file_obj:
            with mmap.mmap(file_obj.fileno(), length=0, access=mmap.ACCESS_READ) as mmap_file_obj:
                data = pickle.load(mmap_file_obj)
                print(data)
    print(f'Operation took {pf()-start} sec(s)')

Here mmap.ACCESS_READ is the mode to open the file in binary. The file_obj returned by open is just used to get the file descriptor which is used to open the stream to the file via mmap as a memory mapped file. As you can see below in the documentation of python open returns the file descriptor or fd for short. So we don't have to do anything with the file_obj operation wise. We just need its fileno() method to get its file descriptor. Also we are not closing the file_obj before the mmap_file_obj. Please take a proper look. We are closing the the mmap block first. As you said in your comment.

open (file, flags[, mode])
Open the file file and set various flags according to flags and possibly its mode according to mode. 
The default mode is 0777 (octal), and the current umask value is first masked out. 
Return the file descriptor for the newly opened file.

Give it a try and see how much impact does it do on your operation You can read more about mmap here. And about file descriptor here

SaGaR
  • 534
  • 4
  • 11
  • Don't you (1) need to open the pickled file in binary mode? and (2) you are clobbering `file_obj` returned by the call to `open` with your call to `mmap.mmap` and that does not seem correct. – Booboo Feb 28 '21 at 15:39
  • 1
    `mmap.ACCESS_READ` is the mode to open the file in binary. The `file_obj` returned by `open` is just used to get the `file descriptor` which is used to open the stream to the file via `mmap`@Booboo – SaGaR Feb 28 '21 at 17:48
  • What makes you think memory mapping the file makes reading it faster? This is true if you are going to make many small reads on the file, or are going to perform random access on the file. If you are instead going to read the file in bulk, how is it faster to do so through a memory map than directly? There is no reason that it should be any faster. – CryptoFool Feb 28 '21 at 18:15
  • @SaGaR These were questons. As far as (1) goes, I have tried it with binary mode and that works. As far as (2) goes, I have not tried it but the link you point to certainly uses a different variable for the call to `mmap.mmap` and the context manager for `open` will attempt to call close on `file_obj`, which may not fail because it might be valid for the memory mapped file, but you might still be leaving the original file handle open. I don't know -- it just looks questionable. If I knew for sure I would have downvoted you instead of asking. – Booboo Feb 28 '21 at 18:34
  • Hey @Booboo I have updated the answer Please take a look again. And tell me if I am able to explain it better. – SaGaR Mar 01 '21 at 01:51
  • Also about your first comment about `file_obj` being used in both context managers. That is not a problem because we are working with context manager of mmap that have different scope then open's context manager – SaGaR Mar 01 '21 at 02:46
  • Also about your first comment about `file_obj` being used in both context managers. That is not a problem because we are working with context manager of mmap that have different scope then open's context manager @booboo – SaGaR Mar 01 '21 at 02:46
  • @cryptofool I have suggested memory mapped files because if the files are big in size that can help. But if the files are small it will only cause an overhead. I will update the answer – SaGaR Mar 01 '21 at 06:35
  • 2
    @SaGaR - My understanding of how things work seems to be just the opposite of what you're saying. Why does reading a whole file into a memory-map, happen any more quickly than reading it into Python's address space prior to it being decoded? There's no reason that I know of that memory mapping large or small files should offer any advantage. The file I/O is the same in that case. The advantage of memory-mapped files comes from being able to read the file all at once when the code isn't going to access the files contents that way, but rather in small chunks, or by seeking around in the file. – CryptoFool Mar 01 '21 at 08:54
0

You can try multiprocessing:

import os,pickle
pickle_list=os.listdir("pickles")

output_dict=dict.fromkeys(pickle_list, '')

def pickle_process_func(picklename):
    with open("pickles/"+picklename, 'rb') as file:
        dapickle=pickle.load(file)

    #if you need previus files output wait for it
    while(!output_dict[pickle_list[pickle_list.index(picklename)-1]]):
        continue

    #thandosomesh
    print("loaded")
    output_dict[picklename]=custom_func_i_dunno(dapickle)
    

from multiprocessing import Pool

with Pool(processes=10) as pool:
     pool.map(pickle_process_func, pickle_list)
  • 1
    This was addressed in the question.. `multiprocessing.Pool.map` uses a single `Queue` (which serializes and deserializes data using `pickle`) to receive results from the child processes, so the speed would bottleneck there instead. You are still limited by the speed of a single core unpickling a stream of data. – Aaron Mar 03 '21 at 21:01
  • How about using shared memory for passing the results ? – Cyrille Pontvieux Mar 05 '21 at 15:44
  • 2
    @CyrillePontvieux `multiprocessing.shared_memory` only exposes a binary bytes-like array of memory, and sharing arbitrary python objects is unsupported. It's great for things like numpy arrays or pandas series objects where the underlying data is just a binary array, but structured data is much more difficult. – Aaron Mar 05 '21 at 22:22
  • @Aaron how about converting pickles to sql? – Rifat Alptekin Çetin Mar 06 '21 at 16:07
  • @RifatAlptekinÇetin would have to benchmark for speed... seems like OP Really wants pickle however... – Aaron Mar 06 '21 at 17:54
0

Consider using HDF5 via h5py instead of pickle. The performance is generally much better than pickle with numerical data in Pandas and numpy data structures and it supports most common data types and compression.

Chris_Rands
  • 38,994
  • 14
  • 83
  • 119