I want to do a multi-label text classification on a big data set set and it seems like that big data machine learning tools such as Apache Mahout or Spark MLLib are not currently support that. I would like to know has any one done a multi-label classification for big data sets before? Are there any plan to integrate multi-label classification in either Mahout or Spark in the near future?
2 Answers
This paper addresses the nature of the benefits you would receive from multioutput forecasting... namely:
- The ability to account for multiple independent input parameters when making a prediction, rather than having to continuously update your metrics for each nth index prediction your are trying to make within a given forecast.
- Computational speed is increased.
Based on your need - I would recommend trying to down-sample to a smaller group for your current problem and then create multiple models around bespoke groups within your dataset if performance does not match what you are looking for.
I am still encountering this challenge myself (4 years since your post...).
Here is a list of helpful articles that I have collected while trying to address this:

- 130
- 1
- 1
- 14
Can we first transform the labels into a class, and then after prediction, transform it back to the original label? for example, i have 3 labels to predict, [y1, y2, y3]
. if [y1, y2, y3] = [1, 0, 1]
, then i give it label = 101 = 5
. And during prediction, I predicted the probability of y1 in the following way:
p(y1=1) = p(100) + p(101) + p(110) + p(111)
. In this way a multi label problem became a multilabel problem

- 1,575
- 5
- 12
- 26

- 1