0

I am one-hot encoding a multi-label, multi-class dataset for training and test set as follows:

in metadata, unique strings that correspond to individual classes, i.e, 'green', 'red', 'square' etc,

onehot = OneHotEncoder(sparse=False).fit_transform(metadata[['color', 'shape']])

array of onehot of shape sample, categories with 1's and 0's as per the classes

This is then split etc etc...

Now later on after training I might get a set of samples i.e data frame of 10 samples, where for example, there are only greens and only squares, not the full range of categories from the original training data.

I modified the original training code as follows:

onehot = OneHotEncoder(sparse=False)
onehotarray = onehot.fit_transform(metadata[['color', 'shape']])

The problem is now when I apply this to the new dataset:

onehotnew = onehot.fit_transform(newdata[['color', 'shape']])

The array is now number of samples, by ONLY number of categories present in new data.. so instead of a (x, 10) for example, it is now shape (x, 4) (because the only categories present are green and square).

Is there a way of preserving the original shape of the one-hot encode with all classes as at the moment this is useless when doing something like a confusion matrix call as now the one-hot encoding only contains one hot encoding for categories present in the new data..

thanks:

answer: as per comment below: removing fit on the new data fixed the problem

onehotnew = onehot.transform(newdata[['color', 'shape']])
  • 1
    Just use the same encoder object, and only `transform` (**not fit**) the test set. See https://scikit-learn.org/stable/common_pitfalls.html#inconsistent-preprocessing – Ben Reiniger Feb 07 '21 at 21:24
  • thank you! that works perfectly... is it still ok to fit_transform in one go on the original data as per my post? onehot = OneHotEncoder(sparse=False) onehotarray = onehot.fit_transform(metadata[['color', 'shape']]) and then simply use onehot.transform(testmetadata[['color', 'shape']]) for all unseen data? i.e, it doesnt have to be 3 lines, oh = OneHotEncoder... oh.fit(x)... oh.transform(etc)... the 2 liner above seems more efficient. –  Feb 07 '21 at 23:03
  • 1
    Yes, `fit_transform` generally has the effect of (and often even is implemented as) running `fit` and then `transform`. – Ben Reiniger Feb 08 '21 at 14:56
  • thank you! not sure how to accept your first comment as official answer but will go ahead and answer this question :) –  Feb 08 '21 at 15:37
  • 1
    Does this answer your question? [predicitng new value through a model trained on one hot encoded data](https://stackoverflow.com/questions/56133664/predicitng-new-value-through-a-model-trained-on-one-hot-encoded-data). Also [over at DS.SE](https://datascience.stackexchange.com/q/54052/55122). – Ben Reiniger Feb 08 '21 at 15:43
  • Yep! Thank you, which is the same as the answer above. I couldn't find any threads searching, so thank you for linking –  Feb 08 '21 at 22:24

0 Answers0