Simplest method for text lemmatization in Scala and Spark

Question

I want to use lemmatization on a text file:

surprise heard thump opened door small seedy man clasping package wrapped.

upgrading system found review spring 2008 issue moody audio backed.

omg left gotta wrap review order asap . understand hand delivered dali lama

speak hands wear earplugs lives . listen maintain link long .

cables cables finally able hear gem long rumored music .
...

and expected output is :

surprise heard thump open door small seed man clasp package wrap.

upgrade system found review spring 2008 issue mood audio back.

omg left gotta wrap review order asap . understand hand deliver dali lama

speak hand wear earplug live . listen maintain link long .

cable cable final able hear gem long rumor music .
...

Can anybody help me ? and who knows the simplest method for lemmatization that it have been implemented in Scala and Spark ?

Well, "best" and "simplest" are not that simple to discuss :-) do you already have any idea in mind in order to improve? — Fabio Fantoni, May 13 '15 at 19:43

abalcerek · Accepted Answer · 2015-05-15T16:53:12.820

7

There is a function from the book Adavanced analitics in Spark, chapter about Lemmatization:

  val plainText =  sc.parallelize(List("Sentence to be precessed."))

  val stopWords = Set("stopWord")

  import edu.stanford.nlp.pipeline._
  import edu.stanford.nlp.ling.CoreAnnotations._
  import scala.collection.JavaConversions._

  def plainTextToLemmas(text: String, stopWords: Set[String]): Seq[String] = {
    val props = new Properties()
    props.put("annotators", "tokenize, ssplit, pos, lemma")
    val pipeline = new StanfordCoreNLP(props)
    val doc = new Annotation(text)
    pipeline.annotate(doc)
    val lemmas = new ArrayBuffer[String]()
    val sentences = doc.get(classOf[SentencesAnnotation])
    for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
      val lemma = token.get(classOf[LemmaAnnotation])
      if (lemma.length > 2 && !stopWords.contains(lemma)) {
        lemmas += lemma.toLowerCase
      }
    }
    lemmas
  }

  val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))
  lemmatized.foreach(println)

Now just use this for every line in mapper.

val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))

EDIT:

I added to the code line

import scala.collection.JavaConversions._

this is needed because otherwise sentences are Java not Scala List. This should now compile without problems.

I used scala 2.10.4 and fallowing stanford.nlp dependencies:

<dependency>
  <groupId>edu.stanford.nlp</groupId>
  <artifactId>stanford-corenlp</artifactId>
  <version>3.5.2</version>
</dependency>
<dependency>
  <groupId>edu.stanford.nlp</groupId>
  <artifactId>stanford-corenlp</artifactId>
  <version>3.5.2</version>
  <classifier>models</classifier>
</dependency>

You can also look at stanford.nlp page there is a lot of examples (in Java) http://nlp.stanford.edu/software/corenlp.shtml.

EDIT:

MapPartition version:

Although i dont know if its gonna speed up job significantly.

  def plainTextToLemmas(text: String, stopWords: Set[String], pipeline: StanfordCoreNLP): Seq[String] = {
    val doc = new Annotation(text)
    pipeline.annotate(doc)
    val lemmas = new ArrayBuffer[String]()
    val sentences = doc.get(classOf[SentencesAnnotation])
    for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
      val lemma = token.get(classOf[LemmaAnnotation])
      if (lemma.length > 2 && !stopWords.contains(lemma)) {
        lemmas += lemma.toLowerCase
      }
    }
    lemmas
  }

  val lemmatized = plainText.mapPartitions(p => {
    val props = new Properties()
    props.put("annotators", "tokenize, ssplit, pos, lemma")
    val pipeline = new StanfordCoreNLP(props)
    p.map(q => plainTextToLemmas(q, stopWords, pipeline))
  })
  lemmatized.foreach(println)

edited May 15 '15 at 16:53

answered May 13 '15 at 23:00

abalcerek

1,807
1
22
27

Thank you @user52045 for your reply, could you please tell me how can i use above code in IntellijIdea ? – Rozita May 14 '15 at 06:59
What exactly are you struggling with? First you have to add stanford.nlp dependencies to your project. Dependencies to maven and sbt you can find there http://mvnrepository.com/artifact/edu.stanford.nlp. – abalcerek May 14 '15 at 07:05
In case u have problem with project configuration look at this link https://docs.sigmoidanalytics.com/index.php/Step_by_Step_instructions_on_how_to_build_Spark_App_with_IntelliJ_IDEA – abalcerek May 14 '15 at 08:02
have you ever run it ? , it shows error when i want to run it on IntellijIdea, can you show me how it work by one example . – Rozita May 14 '15 at 14:27
Seems like a reasonable approach. I would prefer a mapPartitions method instead so that you only have to create the Stanford pipeline once per partition rather than once per RDD entry – David May 14 '15 at 17:03
@David I dont think it should be such a performence hit (but i guess its worh testing) and this version seems much simplier (at least for me). – abalcerek May 14 '15 at 18:25
@David , Could you explain me how can i use mapPartition ? – Rozita May 15 '15 at 08:01
@Rozita You can use almsot the same function but instead appling it to one element of rrd you apply it in loop to every element in partition and pull out initialization of pipline befor the loop. mapPartition takes function from iterable of all elements in partition to the iterable results. – abalcerek May 15 '15 at 09:07
@user52045 , I run above code but it has high time consuming , and when i edited pipeline to mapPartition , it had some errors, could you please guide me in running that with mapPartition method? – Rozita May 15 '15 at 16:11
thanks @user52045. I met some problem in using plainTextToLemmas. for the line val props = new Properties(), it shows no package is imported. could you help to check which class that is from? – HappyCoding Feb 04 '16 at 16:37
@HappyCoding `import java.util.Properties` – abalcerek Feb 04 '16 at 17:15
thanks @user52045. I think you're right! after I import the library and sbt assemble my package, it shows an error " Class java.util.function.Function not found - continuing with a stub.". is it attaching to some specific java or scala version? – HappyCoding Feb 10 '16 at 05:40
hi @user5205 thanks. I solved the above problem mentioned. here just share with my experience. I found updating java version from 1.7 to 1.8 solve the problem. – HappyCoding Feb 10 '16 at 09:16
this just shows me empty results (on the foreach) on the spark shell, even when trying with a larger input string – Havnar May 10 '17 at 13:00
Getting error : Caused by: java.io.IOException: Unable to open "edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger" as class path, filename or URL at edu.stanford.nlp.io.IOUtils.getInputStreamFromURLOrClasspathOrFileSystem(IOUtils.java:485) at edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit(MaxentTagger.java:765) ... 67 more – Ajay Sant Jul 25 '17 at 14:37

score 2 · Answer 2 · answered May 15 '15 at 16:07

I think @user52045 has the right idea. The only modification I would make would be to use mapPartitions instead of map -- this allows you to only do the potentially expensive pipeline creation once per partition. This may not be a huge hit on a lemmatization pipeline, but it will be extremely important if you want to do something that requires a model, like the NER portion of the pipeline.

def plainTextToLemmas(text: String, stopWords: Set[String], pipeline:StanfordCoreNLP): Seq[String] = {
  val doc = new Annotation(text)
  pipeline.annotate(doc)
  val lemmas = new ArrayBuffer[String]()
  val sentences = doc.get(classOf[SentencesAnnotation])
  for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
    val lemma = token.get(classOf[LemmaAnnotation])
    if (lemma.length > 2 && !stopWords.contains(lemma)) {
      lemmas += lemma.toLowerCase
    }
  }
  lemmas
}

val lemmatized = plainText.mapPartitions(strings => {
  val props = new Properties()
  props.put("annotators", "tokenize, ssplit, pos, lemma")
  val pipeline = new StanfordCoreNLP(props)
  strings.map(string => plainTextToLemmas(string, stopWords, pipeline))
})
lemmatized.foreach(println)

tried. mapPartitions performs much better. with map function alone, i easily got stack overflow problem. thanks! — HappyCoding, Feb 25 '16 at 14:26

score 0 · Answer 3 · answered Aug 05 '17 at 21:01

0

I would suggest using the Stanford CoreNLP wrapper for Apache Spark as it gives the official API for the basic core nlp function such as Lemmatization, tokenization, etc.

I have used the same for lemmatization on a spark dataframe.

Link to use :https://github.com/databricks/spark-corenlp

answered Aug 05 '17 at 21:01

Ajay Sant

665
1
11
21

1

GNU General Public License v3.0 – Marsellus Wallace Nov 13 '17 at 16:05

Simplest method for text lemmatization in Scala and Spark

3 Answers3

Linked