2

I have a series of directories, each about 38 MB on disk, that I need to pickle no a Python 3.6 Windows 10 system. When I ran the following the code, the resulting .pickle files were huge, ~158 MB each:

from six.moves import cPickle as pickle
with open(set_filename, 'wb') as f:
    pickle.dump(dataset, f, pickle.HIGHEST_PROTOCOL)

Is this normal? The pickle is 4 times the size of the original data files.

I then tried bz2 with pickle and the resulting .pkl files were much smaller, ~18 MB:

from six.moves import cPickle as pickle
import bz2
with bz2.BZ2File(set_filename, 'wb') as f:
    pickle.dump(dataset, f, pickle.HIGHEST_PROTOCOL)

To decompress and unpickle:

with bz2.BZ2File(pickle_file, mode='r') as f:
    letter_set = pickle.load(f)

I'm happy with the improvement but would take even better compression if I could find it.

Questions:

  1. I notice there is also a bz2.open() method. So far bz2.BZ2File() seems to be working, but when would I want to use open() instead?
  2. What is the difference between "regular" (de)compression using bz2.BZ2File() and "incremental" (bz2.BZ2Compressor()/bz2.BZ2Decompressor) and "one-shot" (bz2.compress/bz2.decompress) (de)compression? I have read the documentation at https://docs.python.org/3.6/library/bz2.html but it doesn't explain these terms or in what cases they might be preferable.
Karl Baker
  • 903
  • 12
  • 27
  • [similar question](https://stackoverflow.com/questions/29258786/what-is-the-difference-between-incremental-and-one-shot-compression) – Silfheed Mar 26 '19 at 17:46

0 Answers0