0

I have an RDD of the following structure:

my_rdd = [Row(text='Hello World. This is bad.'), Row(text='This is good.'), ...]

I can perform parallel processing with python functions:

rdd2=my_rdd.map(lambda f: f.text.split()) 
for x in rdd2.collect():
  print(x)

and it gives me the expected output.

However, when I try to use the spark-NLP sentence breaker or sentiment analyzer, I get an error: PicklingError: Could not serialize object: TypeError: can't pickle _thread.RLock objects

in this line: for x in rdd2.collect():

Here is the code:

documenter = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
    
sentencerDL = SentenceDetectorDLModel\
  .pretrained("sentence_detector_dl", "en") \
  .setInputCols(["document"]) \
  .setOutputCol("sentences")

sd_pipeline = PipelineModel(stages=[documenter, sentencerDL]) 
sd_model = LightPipeline(sd_pipeline)
pipeline = PretrainedPipeline('analyze_sentiment', 'en')

If I try:

rdd2=my_rdd.map(lambda f: pipeline.annotate(f.text))                    

or

rdd2=my_rdd.map(lambda f: sd_model.fullAnnotate(f.text)[0]["sentences"].split()[0])

The error occurs. When I run them without 'mapping' they function as expected.

Would anyone know how to execute the spark-NLP sentence breaker or sentiment analyzer in parallel? What am I doing incorrectly?

Thanks all!

1 Answers1

0

when you apply a Spark-ML pipeline on a dataframe that has data distributed across different partitions you'll get parallel execution by default. The same for a spark-NLP pipeline(which is a Spark-ML pipeline too). So you can do,

pipeline.transform(dataframe)

And create 'dataframe' in such a way that the data is distributed across different nodes. A good tutorial is here,

https://sparkbyexamples.com/pyspark/pyspark-create-dataframe-from-list/

Also for mapping the contents of the dataframe after the transformation with Spark-NLP you can use the functions under sparknlp.functions, for example map_annotations_col which will let you map the content of a specific column in the dataframe which contains Spark-NLP annotations. Btw, this,

rdd2=my_rdd.map(lambda f: pipeline.annotate(f.text))

is something you shouldn't do, you're getting that exception because Spark is trying to serialize your entire pipeline and send it to the cluster nodes. That's not the way it should work, you pass the data to the pipeline and let the pipeline choose what to distribute to the cluster.

AlbertoAndreotti
  • 478
  • 4
  • 13