3

I have a Pandas DataFrame, train, that I'm one-hot encoding. It looks something like this:

    car
0   Mazda
1   BMW
2   Honda

If I use pd.get_dummies, I'll get this:

car_BMW car_Honda   car_Mazda
0   0       0           1
1   1       0           0
2   0       1           0 

All good so far.

However, I don't have access to my test set so I need to handle the possibility that a value for car appears in test that wasn't seen in train.

Suppose test is this:

    car
0   Mazda
1   Audi

Then if I use pd.get_dummies on test, I get:

car_Audi    car_Mazda
0   0           1
1   1           0

Which is wrong, because I have a new column, car_Audi and am missing car_BMW.

I'd like the output of one-hot encoding test to be:

car_BMW car_Honda   car_Mazda
0   0       0           1
1   0       0           0

So it just ignores previously unseen values in test. I definitely don't want to create new columns for previously unseen values in test.

I've looked into sklearn.preprocessing.LabelBinarizer but it outputs a numpy array and the order isn't clear for the columns:

lb = LabelBinarizer()
train_transformed = lb.fit_transform(train_df)

gives me back:

array([[0, 0, 1],
       [1, 0, 0],
       [0, 1, 0]])

Any ideas here?

Thanks!

cs95
  • 379,657
  • 97
  • 704
  • 746
anon_swe
  • 8,791
  • 24
  • 85
  • 145

1 Answers1

1

This isn't a hard problem to solve. LabelBinarizer has a parameter classes_ you can query if you want to know the position of the original labels:

train_transformed = lb.fit_transform(df)

print(train_transformed)
array([[0, 0, 1],
       [1, 0, 0],
       [0, 1, 0]])

print(lb.classes_)
array(['BMW', 'Honda', 'Mazda'], dtype='<U5')
cs95
  • 379,657
  • 97
  • 704
  • 746
  • 1
    @COLDSPEED Thanks. So I guess I can just do `temp = lb.transform(test_df); return pd.DataFrame(temp, columns=lb.classes_)`? – anon_swe May 02 '18 at 16:13