How to do multi-label classification in Apache Spark

Question

I want to do a multi-label text classification on a big data set set and it seems like that big data machine learning tools such as Apache Mahout or Spark MLLib are not currently support that. I would like to know has any one done a multi-label classification for big data sets before? Are there any plan to integrate multi-label classification in either Mahout or Spark in the near future?

score 0 · Answer 1 · answered Aug 21 '19 at 20:20

This paper addresses the nature of the benefits you would receive from multioutput forecasting... namely:

The ability to account for multiple independent input parameters when making a prediction, rather than having to continuously update your metrics for each nth index prediction your are trying to make within a given forecast.
Computational speed is increased.

Based on your need - I would recommend trying to down-sample to a smaller group for your current problem and then create multiple models around bespoke groups within your dataset if performance does not match what you are looking for.

I am still encountering this challenge myself (4 years since your post...).

Here is a list of helpful articles that I have collected while trying to address this:

score 0 · Answer 2 · edited Jun 24 '21 at 19:14

Can we first transform the labels into a class, and then after prediction, transform it back to the original label? for example, i have 3 labels to predict, [y1, y2, y3]. if [y1, y2, y3] = [1, 0, 1], then i give it label = 101 = 5. And during prediction, I predicted the probability of y1 in the following way: p(y1=1) = p(100) + p(101) + p(110) + p(111). In this way a multi label problem became a multilabel problem

How to do multi-label classification in Apache Spark

2 Answers2