I try to learn scala and specificaly text minning (lemmatization ,TF-IDF matrix and LSA).
I have some texts i want to lemmatize and make a classification (LSA). I use spark on cloudera.
So i used the stanfordCore NLP fonction:
def plainTextToLemmas(text: String, stopWords: Set[String]): Seq[String] = {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val pipeline = new StanfordCoreNLP(props)
val doc = new Annotation(text)
pipeline.annotate(doc)
val lemmas = new ArrayBuffer[String]()
val sentences = doc.get(classOf[SentencesAnnotation])
for (sentence <- sentences; token <-sentence.get(classOf[TokensAnnotation])) {
val lemma = token.get(classOf[LemmaAnnotation])
if (lemma.length > 2 && !stopWords.contains(lemma)) {
lemmas += lemma.toLowerCase
}
}
lemmas
}
After that, i try to make an TF-IDF matrix but here is my problem: The Stanford fonction make an RDD in [Seq[string] form. But, i have an error. I need to use a RDD in [String] form (not the [Seq[string]] form).
val (termDocMatrix, termIds, docIds, idfs) = termDocumentMatrix(lemmatized-text, stopWords, numTerms, sc)
Someone know how convert a [Seq[string]] to [String]?
Or i need to change one of my request?.
Thanks for the help. Sorry if it's a dumb question and for the english.
Bye