1

I am using Sparknlp to annotate a long text file in databrick. My code is like this:

    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    val lines = sc.textFile("/FileStore/tables/48320_0-3f0d3.txt")
    import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
    val result = PretrainedPipeline("explain_document_ml").annotate(lines)

But I got the error like this:

command-2722311848879511:1: error: overloaded method value annotate with alternatives:
  (target: Array[String])Array[Map[String,Seq[String]]] <and>
  (target: String)Map[String,Seq[String]]
 cannot be applied to (org.apache.spark.rdd.RDD[String])
val result = PretrainedPipeline("explain_document_ml").annotate(lines)

Since annotate can take string or array as parameters, why can I use the text files as the parameter? How should I modify my code? Thanks!

sophros
  • 14,672
  • 11
  • 46
  • 75
Qiang Yao
  • 165
  • 12
  • You need to convert your text file to a DataFrame (each line to each row, or each sentence to each row) and then use `.transform` which happens in a parallel and distributed manner. The annotate is for short text/string not for the entire file. – Maziyar Feb 14 '20 at 10:40
  • If you update the question with how the text file looks like, I can provide a correct code to transform your text file to an annotated DataFrame. – Maziyar Feb 14 '20 at 10:42
  • @Maziyar I used a "not good" trick to get the code run. `val data = lines.flatMap(line => line.split("""\W+""")).collect().mkString(" ").toLowerCase(); val result = PretrainedPipeline("explain_document_dl").annotate(data)`. However, I don't think this is the correct way to do in spark. – Qiang Yao Feb 15 '20 at 20:35
  • @Maziyar The text file is just a e-book(.txt). This is a small code. I think converting to Dataframe should be the correct way to do it. – Qiang Yao Feb 15 '20 at 20:39

0 Answers0