About an error regarding spark nlp using scala

Question

I am a beginner to spark-nlp and i am learning it by following examples in the johnsnowlabs. I am using SCALA in data bricks.

When i follow the example as follows,

import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler().
    setInputCol("text").
    setOutputCol("document")

val regexTokenizer = new Tokenizer().
    setInputCols(Array("sentence")).
    setOutputCol("token")
val sentenceDetector = new SentenceDetector().
    setInputCols(Array("document")).
    setOutputCol("sentence")

val finisher = new Finisher()
    .setInputCols("token")
    .setIncludeMetadata(true)


finisher.withColumn("newCol", explode(arrays_zip($"finished_token", $"finished_ner")))

I am getting following error when i run the last line :

command-786892578143744:2: error: value withColumn is not a member of com.johnsnowlabs.nlp.Finisher
finisher.withColumn("newCol", explode(arrays_zip($"finished_token", $"finished_ner")))

what may be the reason for this ?

When i try to do the example , by just omitting this line , i added follwoing additional lines of codes

val pipeline = new Pipeline().
    setStages(Array(
        documentAssembler,
        sentenceDetector,
        regexTokenizer,
        finisher
    ))

val data1 = Seq("hello, this is an example sentence").toDF("text")

pipeline.fit(data1).transform(data1).toDF("text")

I got another error when i run the last line :

java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match.

Can anyone help me to fix this issue ?

Thank you

score 2 · Accepted Answer · answered Mar 20 '20 at 19:39

Here what your code should look like, first construct the Pipeline:

import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler().
    setInputCol("text").
    setOutputCol("document")

val regexTokenizer = new Tokenizer().
    setInputCols(Array("sentence")).
    setOutputCol("token")
val sentenceDetector = new SentenceDetector().
    setInputCols(Array("document")).
    setOutputCol("sentence")

val finisher = new Finisher()
    .setInputCols("token")
    .setIncludeMetadata(true)

val pipeline = new Pipeline().
    setStages(Array(
        documentAssembler,
        sentenceDetector,
        regexTokenizer,
        finisher
    ))

Create a simple DataFrame for testing:

val data1 = Seq("hello, this is an example sentence").toDF("text")

Now we fit and transform your DataFrame on this Pipeline:

val prediction = pipeline.fit(data1).transform(data1)

The variable prediction is a DataFrame which in that you can explode the token column. Let's have a look inside prediction DataFrame:

scala> prediction.show
+--------------------+--------------------+-----------------------+
|                text|      finished_token|finished_token_metadata|
+--------------------+--------------------+-----------------------+
|hello, this is an...|[hello, ,, this, ...|   [[sentence, 0], [...|
+--------------------+--------------------+-----------------------+

scala> prediction.withColumn("newCol", explode($"finished_token")).show
+--------------------+--------------------+-----------------------+--------+
|                text|      finished_token|finished_token_metadata|  newCol|
+--------------------+--------------------+-----------------------+--------+
|hello, this is an...|[hello, ,, this, ...|   [[sentence, 0], [...|   hello|
|hello, this is an...|[hello, ,, this, ...|   [[sentence, 0], [...|       ,|
|hello, this is an...|[hello, ,, this, ...|   [[sentence, 0], [...|    this|
|hello, this is an...|[hello, ,, this, ...|   [[sentence, 0], [...|      is|
|hello, this is an...|[hello, ,, this, ...|   [[sentence, 0], [...|      an|
|hello, this is an...|[hello, ,, this, ...|   [[sentence, 0], [...| example|
|hello, this is an...|[hello, ,, this, ...|   [[sentence, 0], [...|sentence|
+--------------------+--------------------+-----------------------+--------+

Your first issue as Alberto mentioned, thinking that finisher was a DataFrame. It is an annotator until it is transformed.
The Second issue was having .toDF() in a place you didn't need it. (after pipeline transformation)
Your explode function being in a bad place aside, you are zipping a column that doesn't even exist in your pipeline: ner

Please feel free to ask any question and I'll update the answer accordingly.

score 1 · Answer 2 · answered Mar 13 '20 at 02:40

I think you have two problems, 1. First, you're trying to apply withColumn to an annotator, you should do it on the dataframe instead. 2. I think that's a problem coming from the toDF() after the transform. You need more columns, and you only provide 1. Also probably you don't need that toDF() at all.

Alberto.

About an error regarding spark nlp using scala

2 Answers2