0

I try to learn scala and specificaly text minning (lemmatization ,TF-IDF matrix and LSA).

I have some texts i want to lemmatize and make a classification (LSA). I use spark on cloudera.

So i used the stanfordCore NLP fonction:

    def plainTextToLemmas(text: String, stopWords: Set[String]): Seq[String] = {
    val props = new Properties()
    props.put("annotators", "tokenize, ssplit, pos, lemma")
    val pipeline = new StanfordCoreNLP(props)
    val doc = new Annotation(text)
    pipeline.annotate(doc)
    val lemmas = new ArrayBuffer[String]()
    val sentences = doc.get(classOf[SentencesAnnotation])
    for (sentence <- sentences; token <-sentence.get(classOf[TokensAnnotation])) {
    val lemma = token.get(classOf[LemmaAnnotation])
    if (lemma.length > 2 && !stopWords.contains(lemma)) {
    lemmas += lemma.toLowerCase
    }
    }
    lemmas
    }

After that, i try to make an TF-IDF matrix but here is my problem: The Stanford fonction make an RDD in [Seq[string] form. But, i have an error. I need to use a RDD in [String] form (not the [Seq[string]] form).

val (termDocMatrix, termIds, docIds, idfs) = termDocumentMatrix(lemmatized-text, stopWords, numTerms, sc)

Someone know how convert a [Seq[string]] to [String]?

Or i need to change one of my request?.

Thanks for the help. Sorry if it's a dumb question and for the english.

Bye

So ode
  • 31
  • 4
  • Sorry i need to clarify my question. The lemmatization fonction made a RDD in [Seq[String form]] but i just need a [String form] for the tf-idf. Do you know a lemmatization fonction making a [String] form – So ode Jul 17 '17 at 15:53

1 Answers1

0

I am not sure what this lemmatization thingy is, but as far as making a string out of a sequence, you can just do seq.mkString("\n") (or replace "\n" with whatever other separator you want), or just seq.mkString if you want it merged without any separator.

Also, don't use mutable structures, it's bad taste in scala:

val lemmas = sentences
  .map(_.get(classOf[TokensAnnotation]))
  .map(_.get(classOf[LemmaAnnotation]))
  .filter(_.length > 2)
  .filterNot(stopWords)
  .mkString
Dima
  • 39,570
  • 6
  • 44
  • 70