1

I am trying to use category_encoders.TargetEncoder to encode a categorical feature. My target variable is a continuous number. However, the output from the target encoder is very strange and I could not interpret it. Could someone give me a hint on what is happening?

Here is my toy code.

from category_encoders import TargetEncoder

df = pd.DataFrame(['A', 'B', 'C', 'D', 'E', 'F', 'F', 'F', 'G', 'G', 'G'], columns=['cat'])
df['target'] = [921, 921, 3.5, 280, 0, 3.5, 3.5, 3.5, 200, 200, 200]

now df looks like

    cat target
0   A   921.0
1   B   921.0
2   C   3.5
3   D   280.0
4   E   0.0
5   F   3.5
6   F   3.5
7   F   3.5
8   G   200.0
9   G   200.0
10  G   200.0

Then I ran the encoder as:

encoder = TargetEncoder()
df['encoded'] = encoder.fit_transform(df["cat"], df['target'])

any here is my output

    cat target  encoded
0   A   921.0   248.727273
1   B   921.0   248.727273
2   C   3.5     248.727273
3   D   280.0   248.727273
4   E   0.0     248.727273
5   F   3.5     32.731807
6   F   3.5     32.731807
7   F   3.5     32.731807
8   G   200.0   205.808433
9   G   200.0   205.808433
10  G   200.0   205.808433

What I don't understand is that, for categories with 1 value in it, (e.g., category 'A' to 'E'), the encoder doesn't seem to differentiate the target value differences. Is that by design?

Yue Y
  • 583
  • 1
  • 6
  • 24

1 Answers1

0

You should have used OrdinalEncoder instead. Here is how to do this:

import category_encoders as ce
from sklearn.preprocessing import LabelEncoder
import numpy as np

df = pd.DataFrame(['A', 'B', 'C', 'D', 'E', 'F', 'F', 'F', 'G', 'G', 'G'], columns=['cat'])
df['target'] = [921, 921, 3.5, 280, 0, 3.5, 3.5, 3.5, 200, 200, 200]

Now, you need to use LabelEncoder like this:

le = LabelEncoder()
encoded = le.fit_transform(np.ravel('target'))

and finally do what you wanted

ce_ord = ce.OrdinalEncoder(cols = ['cat'])
df['encoded_cat'] = ce_ord.fit_transform(df['cat'],df['target'])

which returns

cat  target  encoded_cat
0    A   921.0            1
1    B   921.0            2
2    C     3.5            3
3    D   280.0            4
4    E     0.0            5
5    F     3.5            6
6    F     3.5            6
7    F     3.5            6
8    G   200.0            7
9    G   200.0            7
10   G   200.0            7
  • It seems that the output categories from your example are not related to the target value at all? It is basically mapping the letter-cat to the numerical-cat? – Yue Y Jul 07 '21 at 20:59
  • It creates a category for each unique combination of cat and target, just as a category encoder should. – Serge de Gosson de Varennes Jul 08 '21 at 06:07
  • I guess I am looking more for something similar to the weight of evidence encoder, where I want my encoded feature to be somewhat related to the probably of labels in that category. – Yue Y Jul 08 '21 at 23:41