Pyspark use DocumentAssembler on array

Question

I am trying to use DocumentAssembler for array of strings. The documentation says: "The DocumentAssembler can read either a String column or an Array[String])". But when I do a simple example:

data = spark.createDataFrame([[["Spark NLP is an open-source text processing library."]]]).toDF("text")
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
result = documentAssembler.transform(data)

result.select("document").show(truncate=False)

I am getting an error

AnalysisException: [CANNOT_UP_CAST_DATATYPE] Cannot up cast input from "ARRAY<STRING>" to "STRING".
The type path of the target object is:
- root class: "java.lang.String"
You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object

Maybe I don't understand something?

score 0 · Accepted Answer · answered May 22 '23 at 12:52

0

I think you just added an extra [] around the input

This is working:

data = spark.createDataFrame([["Spark NLP is an open-source text processing library."]]).toDF("text")
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
result = documentAssembler.transform(data)

result.select("document").show(truncate=False)

+----------------------------------------------------------------------------------------------+
|document                                                                                      |
+----------------------------------------------------------------------------------------------+
|[{document, 0, 51, Spark NLP is an open-source text processing library., {sentence -> 0}, []}]|
+----------------------------------------------------------------------------------------------+

answered May 22 '23 at 12:52

Islam Elbanna

1,438
2
9
15

No, you are submitting a string as an input. I want to input an array of strings – Rory May 22 '23 at 13:27
So the API documentation accepts a string for each row of the Dataframe not an array https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/main/scala/com/johnsnowlabs/nlp/DocumentAssembler.scala#L40 – Islam Elbanna May 22 '23 at 13:33
look at row 28-29 in your link: "The `DocumentAssembler` can read either a `String` column or an `Array[String]`. Additionally, [[setCleanupMode]] can be used to pre-process the text" – Rory May 22 '23 at 13:43
1

You are right, it looks like an issue in the API itself, i have tried both Scala and python version and I get the same error. I suggest to raise an issue at https://github.com/JohnSnowLabs/spark-nlp/issues – Islam Elbanna May 22 '23 at 14:46
1

Yes, I wrote in the issue thread. Thanks anyway for your help – Rory May 22 '23 at 14:58

Pyspark use DocumentAssembler on array

1 Answers1