13

While preprocessing the labels for a machine learning classifying task, I need to one hot encode the labels which take string values. It happens that OneHotEncoder from sklearn.preprocessing or to_categorical from kera.np_utils require int inputs. This means that I need to precede the one hot encoder with a LabelEncoder. I have done it by hand with a custom class:

class LabelOneHotEncoder():
    def __init__(self):
        self.ohe = OneHotEncoder()
        self.le = LabelEncoder()
    def fit_transform(self, x):
        features = self.le.fit_transform( x)
        return self.ohe.fit_transform( features.reshape(-1,1))
    def transform( self, x):
        return self.ohe.transform( self.la.transform( x.reshape(-1,1)))
    def inverse_tranform( self, x):
        return self.le.inverse_transform( self.ohe.inverse_tranform( x))
    def inverse_labels( self, x):
        return self.le.inverse_transform( x)

I am confident there must a way of doing it within the sklearn API using a sklearn.pipeline, but when using:

LabelOneHotEncoder = Pipeline( [ ("le",LabelEncoder), ("ohe", OneHotEncoder)])

I get the error ValueError: bad input shape () from the OneHotEncoder. My guess is that the output of the LabelEncoder needs to be reshaped, by adding a trivial second axis. I am not sure how to add this feature though.

Learning is a mess
  • 7,479
  • 7
  • 35
  • 71

4 Answers4

24

It's strange that they don't play together nicely... I'm surprised. I'd extend the class to return the reshaped data like you suggested.

class ModifiedLabelEncoder(LabelEncoder):

    def fit_transform(self, y, *args, **kwargs):
        return super().fit_transform(y).reshape(-1, 1)

    def transform(self, y, *args, **kwargs):
        return super().transform(y).reshape(-1, 1)

Then using the pipeline should work.

pipe = Pipeline([("le", ModifiedLabelEncoder()), ("ohe", OneHotEncoder())])
pipe.fit_transform(['dog', 'cat', 'dog'])

https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/preprocessing/label.py#L39

Learning is a mess
  • 7,479
  • 7
  • 35
  • 71
David Stevens
  • 835
  • 1
  • 6
  • 15
  • thanks for the input. I first accepted your answer as it looked obvious it would work, but I am facing bugs when implementing it. First the pipeline constructor takes classes and not instances, so it must be `ModifiedLabelEncoder` and not `ModifiedLabelEncoder()`. Second, the reshape argument should be `(-1,1)`. After that the `ModifiedLabelEncoder` works on its own, but not in the pipeline. I get a `TypeError: super(type, obj): obj must be an instance or subtype of type` when calling `fit_transform`. – Learning is a mess Feb 22 '18 at 15:02
  • 1
    @Learningisamess Wrong. Pipeline takes instances, not classes. – Vivek Kumar Feb 22 '18 at 15:08
  • @Learningisamess I have corrected the code above, please check now. If still error, post the complete stack trace of error in the question. – Vivek Kumar Feb 22 '18 at 15:16
  • @VivekKumar: Indeed I was wrong on the instances vs class about the pipeline. Will check the new code soon and report. – Learning is a mess Feb 22 '18 at 15:36
  • The code does work after adding support for extra arguments (`*args, **kwargs`). Accepting the answer, thank you! – Learning is a mess Feb 22 '18 at 22:12
17

From scikit-learn 0.20, OneHotEncoder accepts strings, so you don't need a LabelEncoder before it anymore. And you can just use it in a pipeline.

bryant1410
  • 5,540
  • 4
  • 39
  • 40
0

I have used a customized class to wrap my label encoder function and it returns the whole updated dataset.

 class CustomLabelEncode(BaseEstimator, TransformerMixin):
  def fit(self, X, y=None):
   return self
  def transform(self, X ,y=None):
    le=LabelEncoder()
    for i in X[cat_cols]:
    X[i]=le.fit_transform(X[i])
    return X 
cat_cols=['Family','Education','Securities Account','CDAccount','Online','CreditCard']
le_ct=make_column_transformer((CustomLabelEncode(),cat_cols),remainder='passthrough') 
pd.DataFrame(ct3.fit_transform(X)) #This will show you your changes
Final_pipeline=make_pipeline(le_ct)

[I have implemented it you can see my github link] [1]: https://github.com/Ayushmina-20/sklearn_pipeline

-1

It is not for the asked question but for applying only LabelEncoder to all columns you can use the below format

df_non_numeric =df.select_dtypes(['object'])
non_numeric_cols = df_non_numeric.columns.values
from sklearn.preprocessing import LabelEncoder
for col in non_numeric_cols:
    df[col] = LabelEncoder().fit_transform(df[col].values)
df.head()
PlutoSenthil
  • 332
  • 6
  • 13