How do I run LSA/SVD on a Spark DataFrame in a Pipeline?

Question

I would like to be able to use the Pipeline functionality of Spark 2.0+ for building my models, but I cannot figure out how to incorporate LSA/SVD in my Pipeline. I am aware of the functionality on RDDs, but I do not believe that can be incorporated into a Spark.ml Pipeline.

I would like to be able to do something like this:

    Pipeline pipeline = new Pipeline();

    // Break down the long description into word tokens
    Tokenizer tokenizer = new Tokenizer()
            .setInputCol("long_description")
            .setOutputCol("words");

    // Use hashing trick to get word counts
    HashingTF hashingTF = new HashingTF()
            .setNumFeatures(numWords)
            .setInputCol(tokenizer.getOutputCol())
            .setOutputCol("hash");

    // Take the inverse document frequency weights
    IDF idf = new IDF()
            .setInputCol(hashingTF.getOutputCol())
            .setOutputCol("wordFeatures");

     // *** THIS IS NOT POSSIBLE ***
     SVD svd = new SVD()
            .setInputCol(idf.getOutputCol())
            .setOutputCol("svdFeatures");

    // Combine all relevant feature columns into one feature vector
    VectorAssembler assembler = new VectorAssembler()
            .setInputCols(new String[]{...})
            .setOutputCol("features");

    // RandomForest Time!
    RandomForestClassifier rf = new RandomForestClassifier()
            .setFeaturesCol(assembler.getOutputCol())
            .setLabelCol("labels")
            .setNumTrees(numTrees);

    pipeline.setStages(new PipelineStage[]{tokenizer, hashingTF, idf, svd, assembler, rf});

I understand it's possible to do this with PCA. Is there anyway to pull it off with SVD/LSA?

How do I run LSA/SVD on a Spark DataFrame in a Pipeline?

0 Answers0