37

I am running code that creates large objects, containing multiple user-defined classes, which I must then serialize for later use. From what I can tell, only pickling is versatile enough for my requirements. I've been using cPickle to store them but the objects it generates are approximately 40G in size, from code that runs in 500 mb of memory. Speed of serialization isn't an issue, but size of the object is. Are there any tips or alternate processes I can use to make the pickles smaller?

ddn
  • 419
  • 1
  • 5
  • 8
  • What pickle protocol are you using? – user2357112 Aug 27 '13 at 20:30
  • Protocol version 0. Would 2 make a substantial difference? – ddn Aug 27 '13 at 20:44
  • It should make some difference. I'm not sure how much, though. – user2357112 Aug 27 '13 at 20:45
  • 1
    I think protocol 2 would make a substantial difference, it's worth trying. Combine it with compressing with `gzip` or `bzip2` to make an even larger difference. I'd be interested in the numerical results. – pts Aug 27 '13 at 22:58
  • 1
    It came down from 40 to 12.2! The docs I read didn't seem like it'd have such an impact, so I'm quite surprised. Zipping was insufficient before, but it will work now. – ddn Aug 30 '13 at 17:33
  • This post and answers are great. – O.rka Feb 08 '18 at 17:18

3 Answers3

59

You can combine your cPickle dump call with a zipfile:

import cPickle
import gzip

def save_zipped_pickle(obj, filename, protocol=-1):
    with gzip.open(filename, 'wb') as f:
        cPickle.dump(obj, f, protocol)

And to re-load a zipped pickled object:

def load_zipped_pickle(filename):
    with gzip.open(filename, 'rb') as f:
        loaded_object = cPickle.load(f)
        return loaded_object
John Lyon
  • 11,180
  • 4
  • 36
  • 44
49

If you must use pickle and no other method of serialization works for you, you can always pipe the pickle through bzip2. The only problem is that bzip2 is a little bit slowish... gzip should be faster, but the file size is almost 2x bigger:

In [1]: class Test(object):
            def __init__(self):
                self.x = 3841984789317471348934788731984731749374
                self.y = 'kdjsaflkjda;sjfkdjsf;klsdjakfjdafjdskfl;adsjfl;dasjf;ljfdlf'
        l = [Test() for i in range(1000000)]

In [2]: import cPickle as pickle          
        with open('test.pickle', 'wb') as f:
            pickle.dump(l, f)
        !ls -lh test.pickle
-rw-r--r--  1 viktor  staff    88M Aug 27 22:45 test.pickle

In [3]: import bz2
        import cPickle as pickle
        with bz2.BZ2File('test.pbz2', 'w') as f:
            pickle.dump(l, f)
        !ls -lh test.pbz2
-rw-r--r--  1 viktor  staff   2.3M Aug 27 22:47 test.pbz2

In [4]: import gzip
        import cPickle as pickle
        with gzip.GzipFile('test.pgz', 'w') as f:
            pickle.dump(l, f)
        !ls -lh test.pgz
-rw-r--r--  1 viktor  staff   4.8M Aug 27 22:51 test.pgz

So we see that the file size of the bzip2 is almost 40x smaller, gzip is 20x smaller. And gzip is pretty close in performance to the raw cPickle, as you can see:

cPickle : best of 3: 18.9 s per loop
bzip2   : best of 3: 54.6 s per loop
gzip    : best of 3: 24.4 s per loop
Viktor Kerkez
  • 45,070
  • 12
  • 104
  • 85
  • 7
    You've not taken into account lzma, which I find is a very good algorithm. When I used lzma to compress a pickled list of 200000 random numbers, it beated gzip and bzip2 (at least sizewise, I haven't checked the speed) – Paolo Celati Jun 14 '16 at 20:32
  • @Viktor Is there a faster serialization than cpickle? (You write "no other method works for you") – dreamflasher Feb 26 '17 at 08:37
  • @DreamFlasher There are other simpler serialization modules like msgpack, json... But they don't serialize complex python objects, just the basic types. – Viktor Kerkez Feb 28 '17 at 10:52
  • @ViktorKerkez I'm aware of these, was looking for the one with the fastest loading speed, and in that respect pickle is the fastest. But you are right for the question at hand these would have been alternatives (though I don't think they would be smaller). – dreamflasher Feb 28 '17 at 11:48
  • 1
    Why not 'wb' on the compressed files? – wordsforthewise Jul 29 '18 at 21:20
  • Is the `.pbz2` extension a standard convention? – Dr_Zaszuś May 04 '20 at 19:46
4

You might want to use a more efficient pickling protocol.

As of now, there are three pickle protocols:

  • Protocol version 0 is the original ASCII protocol and is backwards compatible with earlier versions of Python.
  • Protocol version 1 is the old binary format which is also compatible with earlier versions of Python.
  • Protocol version 2 was introduced in Python 2.3. It provides much more efficient pickling of new-style classes.

and furthermore, the default is protocol 0, the least efficient one:

If a protocol is not specified, protocol 0 is used. If protocol is specified as a negative value or HIGHEST_PROTOCOL, the highest protocol version available will be used.

Let's check the difference in size between using the latest protocol, which is currently protocol 2 (the most efficient one) and using protocol 0 (the default) for an arbitrary example. Note that I use protocol=-1 here, to make sure we are always using the latest protocol, and that I import cPickle to make sure we are using the faster C implementation:

import numpy
from sys import getsizeof
import cPickle as pickle

# Create list of 10 arrays of size 100x100
a = [numpy.random.random((100, 100)) for _ in xrange(10)]

# Pickle to a string in two ways
str_old = pickle.dumps(a, protocol=0)
str_new = pickle.dumps(a, protocol=-1)

# Measure size of strings
size_old = getsizeof(str_old)
size_new = getsizeof(str_new)

# Print size (in kilobytes) using old, using new, and the ratio
print size_old / 1024.0, size_new / 1024.0, size_old / float(size_new)

The print out I get is:

2172.27246094 781.703125 2.77889698975

Indicating that pickling using the old protocol used up 2172KB, pickling using the new protocol used up 782KB and the difference is a factor of x2.8. Note that this factor is specific to this example - your results might vary, depending on the object you are pickling.

Moot
  • 2,195
  • 2
  • 17
  • 14