I have an RDD of the following structure:
my_rdd = [Row(text='Hello World. This is bad.'), Row(text='This is good.'), ...]
I can perform parallel processing with python functions:
rdd2=my_rdd.map(lambda f: f.text.split())
for x in rdd2.collect():
print(x)
and it gives me the expected output.
However, when I try to use the spark-NLP sentence breaker or sentiment analyzer, I get an error: PicklingError: Could not serialize object: TypeError: can't pickle _thread.RLock objects
in this line: for x in rdd2.collect():
Here is the code:
documenter = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencerDL = SentenceDetectorDLModel\
.pretrained("sentence_detector_dl", "en") \
.setInputCols(["document"]) \
.setOutputCol("sentences")
sd_pipeline = PipelineModel(stages=[documenter, sentencerDL])
sd_model = LightPipeline(sd_pipeline)
pipeline = PretrainedPipeline('analyze_sentiment', 'en')
If I try:
rdd2=my_rdd.map(lambda f: pipeline.annotate(f.text))
or
rdd2=my_rdd.map(lambda f: sd_model.fullAnnotate(f.text)[0]["sentences"].split()[0])
The error occurs. When I run them without 'mapping' they function as expected.
Would anyone know how to execute the spark-NLP sentence breaker or sentiment analyzer in parallel? What am I doing incorrectly?
Thanks all!