8

I'm trying to serialize a PySpark Pipeline object so that it can be saved and retrieved later. Tried using the Python pickle library as well as the PySpark's PickleSerializer, the dumps() call itself is failing.

Providing the code snippet while using native pickle library.

pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
with open ('myfile', 'wb') as f:
   pickle.dump(pipeline,f,2)
with open ('myfile', 'rb') as f:
   pipeline1 = pickle.load(f)

Getting the below error while running:

py4j.protocol.Py4JError: An error occurred while calling o32.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
    at py4j.Gateway.invoke(Gateway.java:252)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:785)

Is it possible to serialize PySpark Pipeline objects ?

Community
  • 1
  • 1
Dinoop Thomas
  • 81
  • 1
  • 2

1 Answers1

3

Technically speaking you can easily pickle Pipeline object:

from pyspark.ml.pipeline import Pipeline
import pickle

pickle.dumps(Pipeline(stages=[]))
## b'\x80\x03cpyspark.ml.pipeline\nPipeline\nq ...

What you cannot pickle is Spark Transformers and Estimators which are only thin wrappers around JVM objects. If you really need this you can wrap this in a function for example:

def make_pipeline():
    return Pipeline(stages=[Tokenizer(inputCol="text", outputCol="words")])

pickle.dumps(make_pipeline)
## b'\x80\x03c__ ...

but since it is just a piece of code and doesn't store any persistent data it doesn't look particularly useful.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • This works when I try to work with an empty Pipeline object - pickle.dumps(Pipeline(stages=[])), however when I try to pickle a Pipeline object with stages, it still fails. Tried the method format you proposed, but if I try pickle.dumps(make_pipeline()), it still fails with the same error. – Dinoop Thomas Apr 17 '16 at 12:44
  • And I will :) Take another look at my code `pickle.dumps(make_pipeline)` and yours `pickle.dumps(make_pipeline)`. I only pickle an object which can be used to generate Pipeline, no Pipeline itself. – zero323 Apr 17 '16 at 14:32