Not able to use Stratified-K-Fold on multi label classifier

Question

The following code is used to do KFold Validation but I am to train the model as it is throwing the error

ValueError: Error when checking target: expected dense_14 to have shape (7,) but got array with shape (1,)

My target Variable has 7 classes. I am using LabelEncoder to encode the classes into numbers.

By seeing this error, If I am changing the into MultiLabelBinarizer to encode the classes. I am getting the following error

ValueError: Supported target types are: ('binary', 'multiclass'). Got 'multilabel-indicator' instead.

The following is the code for KFold validation

skf = StratifiedKFold(n_splits=10, shuffle=True)
scores = np.zeros(10)
idx = 0
for index, (train_indices, val_indices) in enumerate(skf.split(X, y)):
    print("Training on fold " + str(index+1) + "/10...")
    # Generate batches from indices
    xtrain, xval = X[train_indices], X[val_indices]
    ytrain, yval = y[train_indices], y[val_indices]
    model = None
    model = load_model() //defined above

    scores[idx] = train_model(model, xtrain, ytrain, xval, yval)
    idx+=1
print(scores)
print(scores.mean())

I don't know what to do. I want to use Stratified K Fold on my model. Please help me.

panktijk · Accepted Answer · 2019-02-28T00:07:29.353

16

MultiLabelBinarizer returns a vector which is of the length of your number of classes.

If you look at how StratifiedKFold splits your dataset, you will see that it only accepts a one-dimensional target variable, whereas you are trying to pass a target variable with dimensions [n_samples, n_classes]

Stratefied split basically preserves your class distribution. And if you think about it, it does not make a lot of sense if you have a multi-label classification problem.

If you want to preserve the distribution in terms of the different combinations of classes in your target variable, then the answer here explains two ways in which you can define your own stratefied split function.

UPDATE:

The logic is something like this:

Assuming you have n classes and your target variable is a combination of these n classes. You will have (2^n) - 1 combinations (Not including all 0s). You can now create a new target variable considering each combination as a new label.

For example, if n=3, you will have 7 unique combinations:

 1. [1, 0, 0]
 2. [0, 1, 0]
 3. [0, 0, 1]
 4. [1, 1, 0]
 5. [1, 0, 1]
 6. [0, 1, 1]
 7. [1, 1, 1]

Map all your labels to this new target variable. You can now look at your problem as simple multi-class classification, instead of multi-label classification.

Now you can directly use StartefiedKFold using y_new as your target. Once the splits are done, you can map your labels back.

Code sample:

import numpy as np

np.random.seed(1)
y = np.random.randint(0, 2, (10, 7))
y = y[np.where(y.sum(axis=1) != 0)[0]]

OUTPUT:

array([[1, 1, 0, 0, 1, 1, 1],
       [1, 1, 0, 0, 1, 0, 1],
       [1, 0, 0, 1, 0, 0, 0],
       [1, 0, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 1, 1, 1],
       [1, 1, 0, 0, 0, 1, 1],
       [1, 1, 1, 1, 0, 1, 1],
       [0, 0, 1, 0, 0, 1, 1],
       [1, 0, 1, 0, 0, 1, 1],
       [0, 1, 1, 1, 1, 0, 0]])

Label encode your class vectors:

from sklearn.preprocessing import LabelEncoder

def get_new_labels(y):
    y_new = LabelEncoder().fit_transform([''.join(str(l)) for l in y])
    return y_new

y_new = get_new_labels(y)

OUTPUT:

array([7, 6, 3, 3, 2, 5, 8, 0, 4, 1])

edited Feb 28 '19 at 00:07

answered Feb 26 '19 at 23:24

panktijk

1,574
8
10

I am not able to understand the solution given in the link. It was so complicated. Can you explain to me how to use that function in my problem? – Sai Pavan Feb 27 '19 at 07:08
@SaiPavan Updated the answer. – panktijk Feb 27 '19 at 17:29
Thank you very much – Sai Pavan Feb 28 '19 at 07:58
I code you shared is working. But I don't seem to understand the logic behind it. – Sai Pavan Feb 28 '19 at 09:29
@SaiPavan which part exactly? I'd suggest you read about how stratefied splits work and why `sklearn` does not support it for multi-label problems. The main idea here is to transform your problem into a multi-class one so you can apply stratefied splits. – panktijk Mar 01 '19 at 18:43
I don't understand that by directly using label encoder doesn't work but using fit_transform in between works – Sai Pavan Mar 01 '19 at 18:49
@SaiPavan Sorry I don't quite understand what you mean by "using label encoder doesn't work". You **always** have to use `fit_transform` with `LabelEncoder` whenever you want to transform your labels. – panktijk Mar 01 '19 at 19:05
That's where I got it wrong. If you see my question, I have applied `LabelEncoder` to my class variable but I was still getting the error 1 because I didn't apply `fit_transform` then. – Sai Pavan Mar 01 '19 at 19:12
@SaiPavan You have not included your code for label encoding. So I can't say what went wrong there. – panktijk Mar 01 '19 at 19:51
@panktijk thanks for your answer. It seems to me that the `''.join()` you use is not necessary, right? Did you have any reason to use that instead of just calling `str(l)`? – tjiagoM Dec 30 '19 at 17:00
I believe the `''.join(str(l))` part is only done to create a unique representation of each combination. You could use any function that returns a string I think as long as it makes a unique combination for the labelencoder to encode – Gerard May 04 '23 at 09:03

score 0 · Answer 2 · answered May 04 '23 at 09:05

Just to expand on the great work of @panktijk work, here is a full example. Perhaps this could be merged into his answer?

import numpy as np
from sklearn.model_selection import StratifiedGroupKFold, StratifiedKFold
from sklearn.preprocessing import LabelEncoder


np.random.seed(1)
N = 1000
X = np.random.random((N, 100))
y = np.random.randint(0, 2, (N, 7))


def get_new_labels(y):
    """ Convert each multilabel vector to a unique string """
    yy = [''.join(str(l)) for l in y]
    y_new = LabelEncoder().fit_transform(yy)
    return y_new

y_new = get_new_labels(y)
folder = StratifiedKFold(n_splits=2)

for train_indices, test_indices in folder.split(X, y_new):
    # Do stuff with train and test indices

score 0 · Answer 3 · answered May 30 '23 at 09:53

0

With the Github project 'iterative-stratification' with MultilabelStratifiedKFold() [...] there also exists a scikit-learn compatible implementation:

https://github.com/trent-b/iterative-stratification

answered May 30 '23 at 09:53

OliverHennhoefer

677
2
8
21

Not able to use Stratified-K-Fold on multi label classifier

3 Answers3

UPDATE:

Linked