Classification with Spark MLlib in Java

Question

I am trying to build a classification system with Apache Spark's MLlib. I have shortlisted Naive Bayes algorithm to do this, and will be using Java 8 for the support of Lambda expressions. I am a newbie in terms of lambda expressions and hence am facing difficulty in implementing the same in Java.

I am referring to the following link which has the sample written in Scala but am having a hard time converting it to Java 8.

http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/

I am stuck on the following operation and can't get my head around it due to my unfamiliarity with Scala,

val idfs = (termDocsRdd.flatMap(termDoc => termDoc.terms.map((termDoc.doc, _))).distinct().groupBy(_._2) collect {
  // if term is present in less than 3 documents then remove it
  case (term, docs) if docs.size > 3 =>
    term -> (numDocs.toDouble / docs.size.toDouble)
}).collect.toMap

Can someone please point me the right direction about how to build TfIdf vectors for textual document samples while utilizing Sparks RDD operations for distributed processing?

samthebest · Answer 1 · 2014-09-03T04:24:33.223

Ok ill explain line by line, but its quite easy to look up each method in the Scala API docs. Also in the long run you will make your life much easier by sticking to Scala rather than using ultra verbose java.

The first line could be written

val idfs = (termDocsRdd.flatMap(termDoc => termDoc.terms.map(term => (termDoc.doc, term)))

So its just taking each doc's terms, concatenating them all together and adding the termDoc.doc as a key.

.distinct()

^^Obvious

.groupBy(_._2)

We group by the term, so now each term is a key and the value is a Seq of docs

collect {
  case (term, docs) if docs.size > 3 =>
term -> (numDocs.toDouble / docs.size.toDouble)
})

collect is clever function that is like a filter followed by a map, we first filter by the pattern, so ... if docs.size > 3, then map to term -> (numDocs.toDouble / docs.size.toDouble)

So we now have the term as the key and a Double as the value. Finally the last line just turns this RDD into a regular Scala Map:

.collect.toMap

collect here is a stupid name and I think may eventually be deprecated, toArray does the same thing and is far less confusing

Thanks a lot samthebest! That helped a lot. I am trying to convert this into Java and will let you know how it goes. — jatinpreet, Sep 03 '14 at 08:02

Classification with Spark MLlib in Java

1 Answers1