spark-nlp
are designed to be used in its own specific pipelines and input columns for different transformers have to include special metadata.
The exception already tells you that input to the NorvigSweetingModel
should be tokenized:
Make sure such columns have following annotator types: token
If I am not mistaken, at minimum you'll have assemble documents and tokenized here.
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.NorvigSweetingModel
import com.johnsnowlabs.nlp.annotators.Tokenizer
import org.apache.spark.ml.Pipeline
val df = Seq(Seq("abc", "cde"), Seq("eefg", "efa", "efb")).toDF("names")
val nlpPipeline = new Pipeline().setStages(Array(
new DocumentAssembler().setInputCol("names").setOutputCol("document"),
new Tokenizer().setInputCols("document").setOutputCol("tokens"),
NorvigSweetingModel.pretrained().setInputCols("tokens").setOutputCol("corrected")
))
A Pipeline
like this, can be applied on your data with small adjustment - input data has to be string
not array<string>
*:
val result = df
.transform(_.withColumn("names", concat_ws(" ", $"names")))
.transform(df => nlpPipeline.fit(df).transform(df))
result.show()
+------------+--------------------+--------------------+--------------------+
| names| document| tokens| corrected|
+------------+--------------------+--------------------+--------------------+
| abc cde|[[document, 0, 6,...|[[token, 0, 2, ab...|[[token, 0, 2, ab...|
|eefg efa efb|[[document, 0, 11...|[[token, 0, 3, ee...|[[token, 0, 3, ee...|
+------------+--------------------+--------------------+--------------------+
If you want an output that can be exported you should extend your Pipeline
with Finisher
.
import com.johnsnowlabs.nlp.Finisher
new Finisher().setInputCols("corrected").transform(result).show
+------------+------------------+
| names|finished_corrected|
+------------+------------------+
| abc cde| [abc, cde]|
|eefg efa efb| [eefg, efa, efb]|
+------------+------------------+
* According to the docs DocumentAssembler
can read either a String column or an Array[String]
but it doesn't look like it works in practice in 1.7.3:
df.transform(df => nlpPipeline.fit(df).transform(df)).show()
org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(names)' due to data type mismatch: argument 1 requires string type, however, '`names`' is of array<string> type.;;
'Project [names#62, UDF(names#62) AS document#343]
+- AnalysisBarrier
+- Project [value#60 AS names#62]
+- LocalRelation [value#60]