Correct way of one-hot-encoding class labels for multi-class problem

Question

I have a classification problem with multiple classes, let's call them A, B, C and D. My data has the following shape:

X=[#samples, #features, 1], y=[#samples,1].

To be more specific, the y looks like this:

[['A'], ['B'], ['D'], ['A'], ['C'], ...]

When I train a Random Forest classifier on these labels, this works fine, however I read multiple times that class labels also need to be one hot encoded. After the one hot encoding, y is

[[1,0,0,0], [0,1,0,0], ...]

and has the shape

[#samples, 4]

The problem arises when I try to use this as classifier input. The model predicts every one of the four labels individually, meaning that it is also able to produce an output like [0 0 0 0], which I don't want. rfc.classes_ returns

# [array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1])]

How would I tell the model that the labels are one hot encoded instead of multiple labels which shall be predicted independently of each other? Do I need to change my y or do I need to alter some settings of the model?

score 4 · Answer 1 · answered Apr 14 '20 at 07:44

Your original approach, without one-hot encoding, was doing what you wanted.

One-hot encoding is meant for inputs to many models, but outputs for only a few (e.g. training a neural network with cross-entropy loss). So these are only needed for some algorithm implementations, while others can do fine without it.

For output labels, a classifier like RandomForest is just fine with strings and multiple classes.

Jim Chen · Accepted Answer · 2020-04-14T07:50:28.437

You don't have to make one hot encoding when using random forest in sklearn.

What you need is "label encoder", and your Y should looks like

from sklearn.preprocessing import LabelEncoder
y = ["A","B","D","A","C"]
le = LabelEncoder()
le.fit_transform(y)
# array([0, 1, 3, 0, 2], dtype=int64)

I tried to modified the sample code sklearn provided :

from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.datasets import make_classification

>>> X, y = make_classification(n_samples=1000, n_features=4,
...                            n_informative=2, n_redundant=0,
...                            random_state=0, shuffle=False)
y = np.random.choice(["A","B","C","D"],1000)
print(y.shape)
>>> clf = RandomForestClassifier(max_depth=2, random_state=0)
>>> clf.fit(X, y)
>>> clf.classes_
# array(['A', 'B', 'C', 'D'], dtype='<U1')

Either process the y with label encoding or without, it both worked with RandomForestClassifier.

But then why does the approach with ['A' , 'B', ...] work at all? Does sklearn encode Strings automatically? — Matze, Apr 14 '20 at 07:42

Shilpa Shinde · Answer 3 · 2021-03-12T10:08:06.100

Here you need not encode the labels , you can keep then as it is whether string or number as per my knowledge When using neural network you should consider one hot encoding / label encoding Example is in case of bbc classification data

model.predict(sample_data)

array(['entertainment'], dtype='<U13')

One hot encoding is mandatory in case of text data in training set : for example

    name         fuel type

    baleno         petrol

    MG hector      electric

after on hot encoding

  name         fuel type_petrol    fuel_type_electric


 baleno         1                       0


MG hector      0                       1

Correct way of one-hot-encoding class labels for multi-class problem

3 Answers3