Implementing KNN imputation on categorical variables in an sklearn pipeline

Question

I am implementing a pre-processing pipeline using sklearn's pipeline transformers. My pipeline includes sklearn's KNNImputer estimator that I want to use to impute categorical features in my dataset. (My question is similar to this thread but it doesn't contain the answer to my question: How to implement KNN to impute categorical features in a sklearn pipeline)

I know that the categorical features have to be encoded before imputation and this is where I am having trouble. With standard label/ordinal/onehot encoders, when trying to encode categorical features with missing values (np.nan) you get the following error:

ValueError: Input contains NaN

I've managed to "by-pass" that by creating a custom encoder where I replace the np.nan with 'Missing':

class CustomEncoder(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.encoder = None

    def fit(self, X, y=None):
        self.encoder = OrdinalEncoder()
        return self.encoder.fit(X.fillna('Missing'))

    def transform(self, X, y=None):
        return self.encoder.transform(X.fillna('Missing'))

    def fit_transform(self, X, y=None, **fit_params):
        self.encoder = OrdinalEncoder()
        return self.encoder.fit_transform(X.fillna('Missing'))

preprocessor = ColumnTransformer([
    ('categoricals', CustomEncoder(), cat_features),
    ('numericals', StandardScaler(), num_features)],
    remainder='passthrough'
)

pipeline = Pipeline([
    ('preprocessing', preprocessor),
    ('imputing', KNNImputer(n_neighbors=5))
])

In this scenario however I cannot find a reasonable way to then set the encoded 'Missing' values back to np.nan before imputing with the KNNImputer.

I've read that I could do this manually using the OneHotEncoder transformer on this thread: Cyclical Loop Between OneHotEncoder and KNNImpute in Scikit-learn, but again, I'd like to implement all of this in a pipeline to automate the entire pre-processing phase.

Has anyone managed to do this? Would anyone recommend an alternative solution? Is imputing with a KNN algorithm maybe not worth the trouble and should I use a simple imputer instead?

Thanks in advance for your feedback!

As a sort of followup to the second linked thread, there's a stab at a pipeline-able transformer at https://stackoverflow.com/q/66635031/10495893 — Ben Reiniger, Mar 29 '21 at 15:59

score 21 · Accepted Answer · edited Jun 09 '21 at 09:11

I am afraid that this cannot work. If you one-hot encode your categorical data, your missing values will be encoded into a new binary variable and KNNImputer will fail to deal with them because:

it works on each column at a time, not on the full set of one-hot encoded columns
there won't any missing to be dealt with anymore

Anyway, you have a couple of options for imputing missing categorical variables using scikit-learn:

you can use sklearn.impute.SimpleImputer using strategy="most_frequent": this will replace missing values using the most frequent value along each column, no matter if they are strings or numeric data
use sklearn.impute.KNNImputer with some limitation: you have first to transform your categorical features into numeric ones while preserving the NaN values (see: LabelEncoder that keeps missing values as 'NaN'), then you can use the KNNImputer using only the nearest neighbour as replacement (if you use more than one neighbour it will render some meaningless average). For example:

    import numpy as np
    import pandas as pd
    from sklearn.preprocessing import LabelEncoder
    from sklearn.impute import KNNImputer
    
    df = pd.DataFrame({'A': ['x', np.NaN, 'z'], 'B': [1, 6, 9], 'C': [2, 1, np.NaN]})
    
    df = df.apply(lambda series: pd.Series(
        LabelEncoder().fit_transform(series[series.notnull()]),
        index=series[series.notnull()].index
    ))
    
    imputer = KNNImputer(n_neighbors=1)
    imputer.fit_transform(df)
    
    In:
        A   B   C
    0   x   1   2.0
    1   NaN 6   1.0
    2   z   9   NaN
    
    Out:
    array([[0., 0., 1.],
           [0., 1., 0.],
           [1., 2., 0.]])

Use sklearn.impute.IterativeImputer and replicate a MissForest imputer for mixed data (but you will have to processe separately numeric from categorical features). For example:

    import numpy as np
    import pandas as pd
    from sklearn.preprocessing import LabelEncoder
    from sklearn.experimental import enable_iterative_imputer
    from sklearn.impute import IterativeImputer
    from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
    
    df = pd.DataFrame({'A': ['x', np.NaN, 'z'], 'B': [1, 6, 9], 'C': [2, 1, np.NaN]})
    
    categorical = ['A']
    numerical = ['B', 'C']
    
    df[categorical] = df[categorical].apply(lambda series: pd.Series(
        LabelEncoder().fit_transform(series[series.notnull()]),
        index=series[series.notnull()].index
    ))
    
    print(df)
    
    imp_num = IterativeImputer(estimator=RandomForestRegressor(),
                               initial_strategy='mean',
                               max_iter=10, random_state=0)
    imp_cat = IterativeImputer(estimator=RandomForestClassifier(), 
                               initial_strategy='most_frequent',
                               max_iter=10, random_state=0)
    
    df[numerical] = imp_num.fit_transform(df[numerical])
    df[categorical] = imp_cat.fit_transform(df[categorical])
    
    print(df)

BTW, if you are looking to implement all this into a Scikit-learn pipeline, you can have a look at my pipelining for deep learning for tabular data : https://github.com/lmassaron/deep_learning_for_tabular_data I think the class LEncoder is what you are looking for :-) — Luca Massaron, Nov 19 '20 at 15:36
Thanks for your reply and link Luca. Yes, I was looking to implement solution 2) you mention above using an OrdinalEncoder. My idea is that a KNN imputation would give me better results than a SimpleImpute but I am not sure how to evaluate that really. — LazyEval, Nov 20 '20 at 16:43
There is also a third approach, based on an experimental function in Scikit-learn: IterativeImputer which can replicate MissForest (see: https://academic.oup.com/bioinformatics/article/28/1/112/219101), an approach able to deal both with numeric and categorical missing values. I've added as an edit to the answer. — Luca Massaron, Nov 20 '20 at 17:25
The MissForest approach is not only able to deal with mixed type variables, it is also more reliable in imputation, both in the case of missing at random (MAR) and missing not at random (MNAR) which is often the more common case using quasi-experimental data. — Luca Massaron, Nov 20 '20 at 17:35
Very interesting. I think I should indeed explore solution 3). I haven't heard of it though to be honest, is it not a common imputation strategy? Are there cons to this approach? I suppose that it is slower. — LazyEval, Nov 20 '20 at 17:39
It is surprisingly effective, even in data fusion problems (when the missingness is on entire blocks of cases and variables). Anyway it is clearly computationally intensive and sometimes it is better to apply the fit on a sample of your data, not on the full dataset. — Luca Massaron, Nov 20 '20 at 17:57
@LucaMassaron Interesting read! Is there a python package for MissForest imputation available? — spectre, Nov 23 '21 at 17:22
You can use IterativeImputer from Scikit-learn using ExtraTreesRegressor as an estimator which is similar to missForest in R (see: https://scikit-learn.org/stable/auto_examples/impute/plot_iterative_imputer_variants_comparison.html) — Luca Massaron, Nov 24 '21 at 20:35
@LucaMassaron ```df[numerical] = imp_num.fit_transform(df[numerical])``` this line seems to run forever for me. Any idea why this could be happening? Happy to share code if this would help. — Farwent, Jan 03 '22 at 22:34
Depending on the size of the dataset and the number of features involved, iterative fitting may take a long time. Try using all your processors for the job by setting n_jobs=-1 for both RandomForestRegressor and RandomForestClassifier and you may try to patch Scikit-Learn with Intel(R) Extension for Scikit-learn as described here: https://pypi.org/project/scikit-learn-intelex/ — Luca Massaron, Jan 14 '22 at 06:09

score 1 · Answer 2 · answered Nov 21 '20 at 18:09

For anyone interested, I managed to implement a custom label encoder that ignores np.nan and compatible with the sklearn pipeline transformer, similar to Luca Massaron's LEncoder that he implemented on his github repo: https://github.com/lmassaron/deep_learning_for_tabular_data

class CustomEncoder(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.encoders = dict()

    def fit(self, X, y=None):
        for col in X.columns:
            le = LabelEncoder()
            le.fit(X.loc[X[col].notna(), col])
            le_dict = dict(zip(le.classes_, le.transform(le.classes_)))

            # Set unknown to new value so transform on test set handles unknown values
            max_value = max(le_dict.values())
            le_dict['_unk'] = max_value + 1

            self.encoders[col] = le_dict
        return self

    def transform(self, X, y=None):
        for col in X.columns:
            le_dict = self.encoders[col]
            X.loc[X[col].notna(), col] = X.loc[X[col].notna(), col].apply(
                lambda x: le_dict.get(x, le_dict['_unk'])).values
        return X

    def fit_transform(self, X, y=None, **fit_params):
        self.fit(X, y)
        return self.transform(X, y)

Implementing KNN imputation on categorical variables in an sklearn pipeline

2 Answers2

Linked