I would like to be able to use the Pipeline functionality of Spark 2.0+ for building my models, but I cannot figure out how to incorporate LSA/SVD in my Pipeline. I am aware of the functionality on RDDs, but I do not believe that can be incorporated into a Spark.ml Pipeline.
I would like to be able to do something like this:
Pipeline pipeline = new Pipeline();
// Break down the long description into word tokens
Tokenizer tokenizer = new Tokenizer()
.setInputCol("long_description")
.setOutputCol("words");
// Use hashing trick to get word counts
HashingTF hashingTF = new HashingTF()
.setNumFeatures(numWords)
.setInputCol(tokenizer.getOutputCol())
.setOutputCol("hash");
// Take the inverse document frequency weights
IDF idf = new IDF()
.setInputCol(hashingTF.getOutputCol())
.setOutputCol("wordFeatures");
// *** THIS IS NOT POSSIBLE ***
SVD svd = new SVD()
.setInputCol(idf.getOutputCol())
.setOutputCol("svdFeatures");
// Combine all relevant feature columns into one feature vector
VectorAssembler assembler = new VectorAssembler()
.setInputCols(new String[]{...})
.setOutputCol("features");
// RandomForest Time!
RandomForestClassifier rf = new RandomForestClassifier()
.setFeaturesCol(assembler.getOutputCol())
.setLabelCol("labels")
.setNumTrees(numTrees);
pipeline.setStages(new PipelineStage[]{tokenizer, hashingTF, idf, svd, assembler, rf});
I understand it's possible to do this with PCA. Is there anyway to pull it off with SVD/LSA?