11

I am looking to implement with Spark, a multi label classification algorithm with multi output, but I am surprised that there isn’t any model in Spark Machine Learning libraries that can do this.

How can I do this with Spark ?

Otherwise Scikit Learn Logistic Regresssion support multi label classification in input/output , but doesn't support a huge data for training.

to view the code in scikit learn, please click on the following link: https://gist.github.com/mkbouaziz/5bdb463c99ba9da317a1495d4635d0fc

1 Answers1

6

Also in Spark there is Logistic Regression that supports multilabel classification based on the api documentation. See also this.

The problem that you have on scikitlearn for the huge amount of training data will disappear with spark, using an appropriate Spark configuration.

Another approach is to use binary classifiers for each of the labels that your problem has, and get multilabel by running relevant-irrelevant predictions for that label. You can easily do that in Spark using any binary classifier.

Indirectly, what might also be of help, is to use multilabel categorization with nearest-neighbors, which is also state-of-the-art. Some nearest neighbors Spark extensions, like Spark KNN or Spark KNN graphs, for instance.

marilena.oita
  • 919
  • 8
  • 13
  • Thank you so much. Without your answer, nobody would know Spark has multilabel logistic regression support. There’s no mention of it in Spark docs a afaict and the API page doesn’t even show up when I google for Spark multilabel. I don’t know why this wasn’t ever marked as the correct answer. – user2103008 Oct 11 '19 at 03:53
  • Very useful answer: actually there is still a big issue regarding the number of labels (more than that the number of subsets present in the training set), since the string indexer just creates more labels and then trains the model as it were single class. Example: classes $A,B,C$, what happens is just that the stringindexer creates new classes $AB,BC,AC,ABC$, so the model just becomes single label again... – Tommaso Guerrini Oct 16 '19 at 08:36
  • 4
    Update: I actually went through the code Spark has and seems like LogisticRegressionWithSGD is supporting only multi-class and not multi-label. This is based on code here https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/Gradient.scala#L251 where I see softmax being done instead of standard sigmoid for multilabel logistic regression. Maybe the docs meant multi-class rather than multi-label. – user2103008 Nov 03 '19 at 16:26