I am one-hot encoding a multi-label, multi-class dataset for training and test set as follows:
in metadata, unique strings that correspond to individual classes, i.e, 'green', 'red', 'square' etc,
onehot = OneHotEncoder(sparse=False).fit_transform(metadata[['color', 'shape']])
array of onehot of shape sample, categories with 1's and 0's as per the classes
This is then split etc etc...
Now later on after training I might get a set of samples i.e data frame of 10 samples, where for example, there are only greens and only squares, not the full range of categories from the original training data.
I modified the original training code as follows:
onehot = OneHotEncoder(sparse=False)
onehotarray = onehot.fit_transform(metadata[['color', 'shape']])
The problem is now when I apply this to the new dataset:
onehotnew = onehot.fit_transform(newdata[['color', 'shape']])
The array is now number of samples, by ONLY number of categories present in new data.. so instead of a (x, 10)
for example, it is now shape (x, 4)
(because the only categories present are green and square).
Is there a way of preserving the original shape of the one-hot encode with all classes as at the moment this is useless when doing something like a confusion matrix call as now the one-hot encoding only contains one hot encoding for categories present in the new data..
thanks:
answer: as per comment below: removing fit on the new data fixed the problem
onehotnew = onehot.transform(newdata[['color', 'shape']])