While preprocessing the labels for a machine learning classifying task, I need to one hot encode the labels which take string values. It happens that OneHotEncoder
from sklearn.preprocessing
or to_categorical
from kera.np_utils
require int
inputs. This means that I need to precede the one hot encoder with a LabelEncoder
. I have done it by hand with a custom class:
class LabelOneHotEncoder():
def __init__(self):
self.ohe = OneHotEncoder()
self.le = LabelEncoder()
def fit_transform(self, x):
features = self.le.fit_transform( x)
return self.ohe.fit_transform( features.reshape(-1,1))
def transform( self, x):
return self.ohe.transform( self.la.transform( x.reshape(-1,1)))
def inverse_tranform( self, x):
return self.le.inverse_transform( self.ohe.inverse_tranform( x))
def inverse_labels( self, x):
return self.le.inverse_transform( x)
I am confident there must a way of doing it within the sklearn API using a sklearn.pipeline
, but when using:
LabelOneHotEncoder = Pipeline( [ ("le",LabelEncoder), ("ohe", OneHotEncoder)])
I get the error ValueError: bad input shape ()
from the OneHotEncoder
. My guess is that the output of the LabelEncoder
needs to be reshaped, by adding a trivial second axis. I am not sure how to add this feature though.