How to annotate a textFile using sparknlp?

Question

I am using Sparknlp to annotate a long text file in databrick. My code is like this:

    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    val lines = sc.textFile("/FileStore/tables/48320_0-3f0d3.txt")
    import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
    val result = PretrainedPipeline("explain_document_ml").annotate(lines)

But I got the error like this:

command-2722311848879511:1: error: overloaded method value annotate with alternatives:
  (target: Array[String])Array[Map[String,Seq[String]]] <and>
  (target: String)Map[String,Seq[String]]
 cannot be applied to (org.apache.spark.rdd.RDD[String])
val result = PretrainedPipeline("explain_document_ml").annotate(lines)

Since annotate can take string or array as parameters, why can I use the text files as the parameter? How should I modify my code? Thanks!

You need to convert your text file to a DataFrame (each line to each row, or each sentence to each row) and then use `.transform` which happens in a parallel and distributed manner. The annotate is for short text/string not for the entire file. — Maziyar, Feb 14 '20 at 10:40
If you update the question with how the text file looks like, I can provide a correct code to transform your text file to an annotated DataFrame. — Maziyar, Feb 14 '20 at 10:42
@Maziyar I used a "not good" trick to get the code run. `val data = lines.flatMap(line => line.split("""\W+""")).collect().mkString(" ").toLowerCase(); val result = PretrainedPipeline("explain_document_dl").annotate(data)`. However, I don't think this is the correct way to do in spark. — Qiang Yao, Feb 15 '20 at 20:35
@Maziyar The text file is just a e-book(.txt). This is a small code. I think converting to Dataframe should be the correct way to do it. — Qiang Yao, Feb 15 '20 at 20:39

How to annotate a textFile using sparknlp?

0 Answers0