I have a series of directories, each about 38 MB on disk, that I need to pickle no a Python 3.6 Windows 10 system. When I ran the following the code, the resulting .pickle files were huge, ~158 MB each:
from six.moves import cPickle as pickle
with open(set_filename, 'wb') as f:
pickle.dump(dataset, f, pickle.HIGHEST_PROTOCOL)
Is this normal? The pickle is 4 times the size of the original data files.
I then tried bz2
with pickle
and the resulting .pkl files were much smaller, ~18 MB:
from six.moves import cPickle as pickle
import bz2
with bz2.BZ2File(set_filename, 'wb') as f:
pickle.dump(dataset, f, pickle.HIGHEST_PROTOCOL)
To decompress and unpickle:
with bz2.BZ2File(pickle_file, mode='r') as f:
letter_set = pickle.load(f)
I'm happy with the improvement but would take even better compression if I could find it.
Questions:
- I notice there is also a
bz2.open()
method. So farbz2.BZ2File()
seems to be working, but when would I want to useopen()
instead? - What is the difference between "regular" (de)compression using
bz2.BZ2File()
and "incremental" (bz2.BZ2Compressor
()/bz2.BZ2Decompressor
) and "one-shot" (bz2.compress
/bz2.decompress
) (de)compression? I have read the documentation at https://docs.python.org/3.6/library/bz2.html but it doesn't explain these terms or in what cases they might be preferable.