Spark ML API to convert a vector to a probability for multilabel classification

Question

I'm a bit new to Spark ML API. I'm trying to do multi-label classification for 160 labels by training 160 classifiers(logistic or random forest etc). Once I train on Dataset[LabeledPoint], I'm finding it hard to get an API where I get the probability for each class for a single example. I've read on SO that you can use the pipeline API and get the probabilities, but for my use case this is going to be hard because I'll have to repicate 160 RDDs for my evaluation features, get probability for each class and then do a join to rank the classes by their probabilities. Instead, I want to just have one copy of evaluation features, broadcast the 160 models and then do the predictions inside the map function. I find myself having to implement this but wonder if there's another convenience API in Spark to do the same for different classifiers like Logistic/RF which converts a Vector representing features to the probability for it belonging to a class. Please let me know if there's a better way to approach multi-label classification in Spark.

EDIT: I tried to create a function to transform a vector to a label for random forest, but it's super annoying because I now have to clone large pieces of tree traversal in Spark, and almost everywhere I encountered dead ends because some function or variable was private or protected. Correct me if wrong, but if this use case is not already implemented, I think it atleast is well-justified because Scikit-learn already has such APIs in place to do this.

Thanks

user2103008 · Answer 1 · 2018-09-10T05:55:01.920

Found the culprit line in Spark MLLib code: https://github.com/apache/spark/blob/5ad644a4cefc20e4f198d614c59b8b0f75a228ba/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L224

The predict method is marked as protected but it should actually be public for such use cases to be supported.

This has been fixed in version 2.4 as seen here: https://github.com/apache/spark/blob/branch-2.4/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala

So upgrading to version 2.4 should do the trick ... although I don't think 2.4 is out yet, so it's a matter of waiting.

EDIT: for people that are interested, apparently not only is this beneficial for multi-label prediction, it's been observed that there's 3-4x improvement in latency as well for regular classification/regression for single instance/small batch predictions (see https://issues.apache.org/jira/browse/SPARK-16198 for details).

Spark ML API to convert a vector to a probability for multilabel classification

1 Answers1