0

I'm trying to fit a model for a decision tree classifier with scikit-learn module. I have 5 features and one of those is categorical, not numerical

from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv()
labelEncoders = {}
for column in df.dtypes[df.dtypes == 'object'].index:
    labelEncoders[column] = LabelEncoder()
    df[column] = labelEncoders[column].fit_transform(df[column])
    print(labelEncoders[column].inverse_transform([0, 1, 2])) #['High', 'Low', 'Normal']

I'm new to ML and I've been reading about the need to encode categorical features before feeding the data frame to the model, and how there are encoding variants like label encoding and one hot encoding.

Now, according to most literature, label encoding should or could be used when the values of the feature can be naturally ordered, for instance, 'Low', 'Normal', 'High'; otherwise one should use one hot encoding so the model doesn't establish a misleading order relationship between the values when there is none that would make sense semantically, for example, 'Brazil', 'Congo', 'Czech Republic'.

So, that's where I'm at with the logic behind choosing a coding strategy, and that's why I'm asking this:

how can I make scikit-learn's LabelEncoder keep the natural order of the values, how can I make it encode like this:

Low -> 0
Normal -> 1
High -> 2

and NOT the way it's doing it now:

High -> 0
Low -> 1
Normal -> 2

Can this be done at all? Is it actually the encoder's task? Do I have to do it somewhere else before the encoding?

Thanks

Scaramouche
  • 3,188
  • 2
  • 20
  • 46
  • The "misleading" relationship will not harm the classifier. The classifier will learn the relationships and does not care if it is "your right order"...no need at all to adjust it – PV8 Nov 27 '19 at 09:16
  • @PV8 OP's statement is supported by the ```scikit-learn``` docs, see "5.3.4. Encoding categorical features" here: https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features – nheise Nov 27 '19 at 09:45
  • For decision tree classifier this is not valid....if we talk about regression I agree, but not with a decision tree – PV8 Nov 27 '19 at 09:50

1 Answers1

1

You can use pandas' replace function pandas.DataFrame.replace() to explicitly pass in the encodings you want to use. As an example:

import pandas as pd

df = pd.DataFrame(data={
    "ID": [1, 2, 3, 4, 5],
    "Label": ["Low", "High", "Low", "High", "Normal"],
})

print("Original:")
print(df)

label_mapping = {"Low": 0, "Normal": 1, "High": 2}
df = df.replace({"Label": label_mapping})

print("Mapped:")
print(df)

Output:

Original:
   ID   Label
0   1     Low
1   2    High
2   3     Low
3   4    High
4   5  Normal
Mapped:
   ID  Label
0   1      0
1   2      2
2   3      0
3   4      2
4   5      1
nheise
  • 311
  • 1
  • 8
  • just to understand, what you are proposing is that I encode them NOT using `LabelEncoder`, and just doing it kind of manually? I mean, I'm ok with that if it works but, what's the use of the `LabelEncoder` then? when does it become viable to use it? – Scaramouche Nov 27 '19 at 23:28
  • Yes that's what I'm suggesting. Somehow you have to provide the mapping in the case of ordinal labels, or else how can the encoder know that ```low < normal < high```? You can always try it once with ```LabelEncoder``` and once with ```DataFrame.replace()``` and see how your results differ. ```LabelEncoder``` is just another tool that you can choose to use or not use. – nheise Nov 28 '19 at 06:41