I'm trying to implement a simple SVM classification algorithm using Spark MLlib.
I have a bunch of Strings and their labels and now I want to perform TF-IDF on them an feed the results to the SVM algorithm.
So what I am looking for is a transformation from String -> LabeledPoint with the TF-IDF step in the middle.
I followed this example: http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf
It did not work since transform() does not work on RDDs but on Dataframes instead.
So I followed this tutorial: https://spark.apache.org/docs/latest/ml-features.html#tf-idf-hashingtf-and-idf
It worked pretty well. But now I'm stuck with a Dataframe and do not know how to get it transformed to a JavaRDD.
I tried this scala solution From DataFrame to RDD[LabeledPoint]
But it does not work since I'm using java.
I tried this one Spark MLLib TFIDF implementation for LogisticRegression
but suprise transform() does not work with JavaRDDs.
So this is the code I got from the tutorial. I am like only looking for the function to put where the question marks are....
JavaRDD<Row> jrdd = documents.map(f -> RowFactory.create(0, f.getText()));
StructType schema = new StructType(new StructField[]{
new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
});
SQLContext sqlContext = new SQLContext(sc);
DataFrame sentenceData = sqlContext.createDataFrame(jrdd, schema);
Tokenizer tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words");
DataFrame wordsData = tokenizer.transform(sentenceData);
int numFeatures = 20;
HashingTF hashingTF = new HashingTF()
.setInputCol("words")
.setOutputCol("rawFeatures")
.setNumFeatures(numFeatures);
DataFrame featurizedData = hashingTF.transform(wordsData);
IDF idf = new IDF().setInputCol("rawFeatures").setOutputCol("features");
IDFModel idfModel = idf.fit(featurizedData);
DataFrame rescaledData = idfModel.transform(featurizedData);
JavaRDD<LabeledPoint> labeled = rescaledData.map(????????????????????????);
So what am I doing wrong? How can I do this? I'm getting crazy here.
Thank you in advance.