I'm a bit new to Spark ML API. I'm trying to do multi-label classification for 160 labels by training 160 classifiers(logistic or random forest etc). Once I train on Dataset[LabeledPoint], I'm finding it hard to get an API where I get the probability for each class for a single example. I've read on SO that you can use the pipeline API and get the probabilities, but for my use case this is going to be hard because I'll have to repicate 160 RDDs for my evaluation features, get probability for each class and then do a join to rank the classes by their probabilities. Instead, I want to just have one copy of evaluation features, broadcast the 160 models and then do the predictions inside the map function. I find myself having to implement this but wonder if there's another convenience API in Spark to do the same for different classifiers like Logistic/RF which converts a Vector representing features to the probability for it belonging to a class. Please let me know if there's a better way to approach multi-label classification in Spark.
EDIT: I tried to create a function to transform a vector to a label for random forest, but it's super annoying because I now have to clone large pieces of tree traversal in Spark, and almost everywhere I encountered dead ends because some function or variable was private or protected. Correct me if wrong, but if this use case is not already implemented, I think it atleast is well-justified because Scikit-learn already has such APIs in place to do this.
Thanks