I am trying to build a classification system with Apache Spark's MLlib. I have shortlisted Naive Bayes algorithm to do this, and will be using Java 8 for the support of Lambda expressions. I am a newbie in terms of lambda expressions and hence am facing difficulty in implementing the same in Java.
I am referring to the following link which has the sample written in Scala but am having a hard time converting it to Java 8.
I am stuck on the following operation and can't get my head around it due to my unfamiliarity with Scala,
val idfs = (termDocsRdd.flatMap(termDoc => termDoc.terms.map((termDoc.doc, _))).distinct().groupBy(_._2) collect {
// if term is present in less than 3 documents then remove it
case (term, docs) if docs.size > 3 =>
term -> (numDocs.toDouble / docs.size.toDouble)
}).collect.toMap
Can someone please point me the right direction about how to build TfIdf vectors for textual document samples while utilizing Sparks RDD operations for distributed processing?