4

I need to perform a multiclass multilabel classification with CatBoost.

Example data:

X = [[1, 2, 3, 4], [2, 3, 5, 1], [4, 5, 1, 3]]

y = [[3, 1], [2, 8], [7, 8]]

Could you provide a working example?

I suppose I'd need to wrap the CatBoostClassifier with some sklearn classifier.

Thanks!

biruk1230
  • 3,042
  • 4
  • 16
  • 29
user1234
  • 41
  • 1
  • 4

1 Answers1

6

You are right in that this can be done using a sklearn wrapper, specifically sklearns implementation of one-vs-rest classifier. This technique builds a classifier for each class, treating your problem as a combinatiuon of binary classification problems, one for each class.

How does this work? For a given class the samples labeled with the given class constitute the positive samples and all the others are treated as negative samples.

This is a viable approach, when your number of classes is small. However, when you have a large number of classes, the memory usage and training time will become prohibitive. In this case, it could be far more efficient to implement a solution using a neural network based approach, granted that you have a good amount of data.

Heres a working example:

from catboost import CatBoostClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer

##Using your example data 

X = [[1, 2, 3, 4], [2, 3, 5, 1], [4, 5, 1, 3]]

y = [[3, 1], [2, 8], [7, 8]]

mlb = MultiLabelBinarizer()
mlb.fit(y)
y_k_hot = mlb.transform(y)

ovr = OneVsRestClassifier(estimator=CatBoostClassifier(iterations=10,random_state=1))
ovr.fit(X,y_k_hot)

ovr.predict(X)*mlb.classes_

array([[1, 0, 3, 0, 0],
       [0, 2, 0, 0, 8],
       [0, 0, 0, 7, 8]])

  • What's a big number of classes for this context? –  Nov 25 '20 at 04:20
  • 1
    @Walter This depends on how large your individual estimators are, in this case we are using catboost as our base estimators. A catboost model is a lot larger than say, a simlpe linear regression model. So the point im making is that if you use catboost as your base estimator you will probably find that your OVR model gets very large (a few gb) when you have ~>100 classes. But you could alwasy reduce the number of trees in each catboost model to alleviate this, at the expense of your model performance. – Lars Vagnes Dec 02 '20 at 09:26
  • 1
    Thanks Lars. I discarded ensemble methods (Catboost, LighGBM, XGBoost, etc) as an option to solve my classification problem cause I have dozens to hundreds of classes in 3 different labels, where all of them are categorical so I would need to use one-hot-encode and at the end don't get a good classification result. Cause the classes, labels, and amount of data may differ I tried Auto ML, which deals with multi-label and multi-class, but I found a problem to control the final layer (I wanted to use Cross Entropy With Logits) and a bad accuracy. –  Dec 02 '20 at 12:44
  • I know that the question is not about the algorithms that I'm talking about, but it can help someone that comes here. Now I'm using Transformers with Bert and Roberta architectures through Huggingface, Simple Transformers, PyTorch, and TensorFlow. To test Auto ML I used Auto Keras and I need to test in the future H2O and Auto Sklearn - that indeed takes a 'Machine Learn' approach, different from the other ones that take a 'Deep Learn' approach. Auto ML is available on GCP so it's easy to scale for image, video, tabular, etc; data. –  Dec 02 '20 at 12:51