Conflict between dill and pickle while using pyspark

Question

Currently pyspark uses 2.4.0 version as part of conda installation. pip installation allows to use a later version of pyspark which is 3.1.2. but using this version, dill library has conflicts with pickle library.

i use this for unit test for pyspark. If I import dill library in test script, or any other test which imports dill which is run along with the pyspark test using pytest, it breaks.

The error it gives the below given error.

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/pyspark/serializers.py", line 437, in dumps
    return cloudpickle.dumps(obj, pickle_protocol)
  File "/opt/conda/lib/python3.6/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 101, in dumps
    cp.dump(obj)
  File "/opt/conda/lib/python3.6/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 540, in dump
    return Pickler.dump(self, obj)
  File "/opt/conda/lib/python3.6/pickle.py", line 409, in dump
    self.save(obj)
  File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/lib/python3.6/pickle.py", line 751, in save_tuple
    save(element)
  File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/lib/python3.6/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 722, in save_function
    *self._dynamic_function_reduce(obj), obj=obj
  File "/opt/conda/lib/python3.6/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 659, in _save_reduce_pickle5
    dictitems=dictitems, obj=obj
  File "/opt/conda/lib/python3.6/pickle.py", line 610, in save_reduce
    save(args)
  File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/lib/python3.6/pickle.py", line 751, in save_tuple
    save(element)
  File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/lib/python3.6/pickle.py", line 736, in save_tuple
    save(element)
  File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/lib/python3.6/site-packages/dill/_dill.py", line 1146, in save_cell
    f = obj.cell_contents
ValueError: Cell is empty

This happens in /opt/conda/lib/python3.6/pickle.py file in save function. After persistent id and memo check it tries to get the type of the obj and if that is ‘cell’ class, it tries to get the details of it in the next line using self.dispatch.get function. On using pyspark 2.4.0 returns ‘None’ and it works well but on using pyspark 3.1.2, it returns an object and it forces the object to use save_reduce function. It is unable to save it since the cell is empty. Eg: <cell at 0x7f0729a2as66: empty>,

If we force the return value to be None for pyspark 3.1.2 installation, it works, but that needs to happen gracefully, than by hardcoding.

Anyone had this issue ? any suggestion on using which versions of dill, pickle and pyspark to use together.

here is the code that is being used

import pytest
from pyspark.sql import SparkSession
import dill # if this line is added, test does not work with pyspark-3.1.2

simpleData = [
    ("James", "Sales", "NY", 90000, 34, 10000),
]

schema = ["A", "B", "C", "D", "E", "F"]


@pytest.fixture(scope="session")
def start_session(request):
    spark = (
        SparkSession.builder.master("local[1]")
        .appName("Python Spark unit test")
        .getOrCreate()
    )
    yield spark
    spark.stop()


def test_simple_rdd(start_session):

    rdd = start_session.sparkContext.parallelize([1, 2, 3, 4, 5, 6, 7])
    assert rdd.stdev() == 2.0

This works with pyspark 2.4.0 but does not work with pyspark 3.1.2 with the above given error.

dill version - 0.3.1.1 pickle version - 4.0 python - 3.6

Can you post minimal code that generates the error you are seeing? Also, it's not clear what you are asking. Also, from the first line of your traceback, it seems that you are using `cloudpickle`... so do you mean you are seeing a conflict with `pickle` or with `cloudpickle`? There are known conflicts between `dill` and `cloudpickle`, and also no serializer can guarantee object transfer between multiple versions of a code... can you clarify (1) your question, and (2) your test code that produces the error? — Mike McKerns, Sep 28 '21 at 13:57
added sample code in the body of the question. the error initiates from this line, assert rdd.stdev() == 2.0 The pytest error gives a whole lot of stack trace, which i am not posting here. — python_enthusiast, Sep 28 '21 at 15:06
@python_enthusiast the code above is not enough for me to reproduce. Could you please check it? — M4rk, Nov 22 '21 at 14:56

score 2 · Answer 1 · answered Sep 29 '21 at 12:19

Apparently you aren't using dill except to import it. I assume you will be using it later...? As I mentioned in my comment, cloudpickle and dill do have some mild conflicts, and this appears to be what you are experiencing. Both serializers add logic to the pickle registry to tell python how to serialize different kinds of objects. So, if you use both dill and cloudpickle, there can be conflicts as the pickle registry is a dict -- so the order of import and etc matters. The issue is similar to as noted here: https://github.com/tensorflow/tfx/issues/2090

There's a few things you can try:

(1) some codes allow you to replace the serializer. So, if you are able replace cloudpickle for dill, then that may resolve the conflicts. I'm not sure this can be done with pyspark, but there is a pyspark module on serializers, so that is promising... Set PySpark Serializer in PySpark Builder

(2) dill provides a mechanism to help mitigate some of the conflicts in the pickle registry. If you use dill.extend(False) before using cloudpickle, then dill.extend(True) before using dill, it may clear up the issue you are seeing.

Conflict between dill and pickle while using pyspark

1 Answers1