2

The default pickle module from the Python standard library does not allow for the serialization of functions with closures, lambdas, or functions in __main__ (see here).

I need to pickle an object using some custom functions that will not be importable where they will be unpickled. There are a few other Python object serializers, including dill and cloudpickle, that are capable of doing this.

The cloudpickle documentation seems to be saying that even when you pickle using cloudpickle, you can unpickle using the standard pickle module. This is extremely attractive, because I cannot even install packages in the environment where I need to unpickle.

Indeed, the example in the documentation does basically the following:

Pickle:

>>> import cloudpickle
>>> squared = lambda x: x ** 2
>>> pickled_lambda = cloudpickle.dump(squared, open('pickled_file', 'w'))

Unpickle:

>>> import pickle
>>> new_squared = pickle.load(open('pickled_file', 'rb'))
>>> new_squared(2)

But, running that second block in an environment where cloudpickle is not installed, even though it is never imported, yields the error:

"ImportError: No module named cloudpickle.cloudpickle"

Probably the most easily reproducible example would be to install cloudpickle for Python2, run the first block, and then try to load in the pickled file with the second block using Python3 (where cloudpickle was not installed).

What is going on here? Why does cloudpickle need to be installed to run the standard pickle load if it is not even called?

Sealander
  • 3,467
  • 4
  • 19
  • 19

1 Answers1

1

In theory, cloudpickle should not need to be installed to load a pickled object. In theory, what cloudpickle would do would be too include all functions necessary to unpickle an object within that object. However, that's in theory.

In the method registry (e.g. with copyreg), a serializer would need to register the method that enables the serializer to create a new object of the required type and imbue it with the saved state. For a serializer to not be required to be installed on load, the serializer would need to include all of the required deserialization methods within the pickled object itself (this is possible because a pickle is recursive).

cloudpickle assumes cloudpickle is installed, and therefore (to make the resulting pickled object smaller), does not include all of the required methods. This is unlike numpy, as a counter-example, which the dumps method on the numpy.array does include the reconstruct method in the pickle (you can see this as numpy.core.multiarray\n_reconstruct appears in any pickle of an array).

Mike McKerns
  • 33,715
  • 8
  • 119
  • 139
  • Do you know if `cloudpickle` can be made recursive to not require its installation to unpickle? Or is the answer 'it needs to be installed'. – Sealander Mar 09 '16 at 20:40
  • I'm the `dill` author, so I can only speak in theory for `cloudpickle`… however, if you wanted to do this w/o forking `cloudpickle`, I think the answer it "no", not easily at the very least -- you need to install `cloudpickle`. The work-around is that you'd need to register every one of the methods used by `cloudpickle` into `pickle` with `copyreg` and then also save the resulting method table (it's a dict) as a pickle that would be first opened (so it could be referenced from `globals`) before the target object is unpickled. – Mike McKerns Mar 09 '16 at 20:57
  • similar to this: http://stackoverflow.com/questions/27351980/how-to-add-a-custom-type-to-dills-pickleable-types – Mike McKerns Mar 09 '16 at 20:59