java - Spark MLlib - Transforming Strings to TF-IDF LabeledPoint RDDs

Question

I'm trying to implement a simple SVM classification algorithm using Spark MLlib.

I have a bunch of Strings and their labels and now I want to perform TF-IDF on them an feed the results to the SVM algorithm.

So what I am looking for is a transformation from String -> LabeledPoint with the TF-IDF step in the middle.

I followed this example: http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf

and this https://github.com/apache/spark/blob/master/mllib/src/test/java/org/apache/spark/mllib/feature/JavaTfIdfSuite.java

It did not work since transform() does not work on RDDs but on Dataframes instead.

So I followed this tutorial: https://spark.apache.org/docs/latest/ml-features.html#tf-idf-hashingtf-and-idf

It worked pretty well. But now I'm stuck with a Dataframe and do not know how to get it transformed to a JavaRDD.

I tried this scala solution From DataFrame to RDD[LabeledPoint]

But it does not work since I'm using java.

I tried this one Spark MLLib TFIDF implementation for LogisticRegression

but suprise transform() does not work with JavaRDDs.

So this is the code I got from the tutorial. I am like only looking for the function to put where the question marks are....

    JavaRDD<Row> jrdd = documents.map(f -> RowFactory.create(0, f.getText()));

    StructType schema = new StructType(new StructField[]{
      new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
      new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
    });
    SQLContext sqlContext = new SQLContext(sc);
    DataFrame sentenceData = sqlContext.createDataFrame(jrdd, schema);
    Tokenizer tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words");
    DataFrame wordsData = tokenizer.transform(sentenceData);
    int numFeatures = 20;
    HashingTF hashingTF = new HashingTF()
      .setInputCol("words")
      .setOutputCol("rawFeatures")
      .setNumFeatures(numFeatures);
    DataFrame featurizedData = hashingTF.transform(wordsData);
    IDF idf = new IDF().setInputCol("rawFeatures").setOutputCol("features");
    IDFModel idfModel = idf.fit(featurizedData);
    DataFrame rescaledData = idfModel.transform(featurizedData);
    JavaRDD<LabeledPoint> labeled = rescaledData.map(????????????????????????);

So what am I doing wrong? How can I do this? I'm getting crazy here.

Thank you in advance.

Hello, you should move to ML package instead of MLLib (which will be deprecated in favor of 100% DataFrame support), this is much easier. — Thomas Decaux, Jul 09 '17 at 12:04
To give you an example of what I am doing for that: https://gist.github.com/ebuildy/cd914857b7a2f6968ab0cfd0d9bb5bef — Thomas Decaux, Jul 11 '17 at 12:29
can you please describe how to implemented svm and put tfidf vector for classification in java? — wadhwasahil, Nov 11 '17 at 15:42

score 2 · Accepted Answer · answered Aug 11 '16 at 09:03

I solved this problem the following way. Was pretty easy, just needed some though.

    JavaRDD<Row> jrdd = preprocessedDocuments.map(f-> RowFactory.create(f.getLabel(), f.getText()));

    StructType schema = new StructType(new StructField[]{
      new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
      new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
    });
    SQLContext sqlContext = new SQLContext(sc);
    DataFrame sentenceData = sqlContext.createDataFrame(jrdd, schema);
    Tokenizer tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words");
    DataFrame wordsData = tokenizer.transform(sentenceData);
    int numFeatures = 20;
    HashingTF hashingTF = new HashingTF()
      .setInputCol("words")
      .setOutputCol("rawFeatures")
      .setNumFeatures(numFeatures);
    DataFrame featurizedData = hashingTF.transform(wordsData);
    DataFrame rescaledData = idfModel.transform(featurizedData);
    JavaRDD<Row> rows = rescaledData.rdd().toJavaRDD();
    JavaRDD<LabeledPoint>  data = rows.map(f -> new LabeledPoint(f.getDouble(0), f.getAs(4)));

please describe the svm implementation on putting tfidf vector in svm in java spark — wadhwasahil, Nov 11 '17 at 15:43

java - Spark MLlib - Transforming Strings to TF-IDF LabeledPoint RDDs

1 Answers1