I am implementing a pre-processing pipeline using sklearn's pipeline transformers. My pipeline includes sklearn's KNNImputer estimator that I want to use to impute categorical features in my dataset. (My question is similar to this thread but it doesn't contain the answer to my question: How to implement KNN to impute categorical features in a sklearn pipeline)
I know that the categorical features have to be encoded before imputation and this is where I am having trouble. With standard label/ordinal/onehot encoders, when trying to encode categorical features with missing values (np.nan) you get the following error:
ValueError: Input contains NaN
I've managed to "by-pass" that by creating a custom encoder where I replace the np.nan with 'Missing':
class CustomEncoder(BaseEstimator, TransformerMixin):
def __init__(self):
self.encoder = None
def fit(self, X, y=None):
self.encoder = OrdinalEncoder()
return self.encoder.fit(X.fillna('Missing'))
def transform(self, X, y=None):
return self.encoder.transform(X.fillna('Missing'))
def fit_transform(self, X, y=None, **fit_params):
self.encoder = OrdinalEncoder()
return self.encoder.fit_transform(X.fillna('Missing'))
preprocessor = ColumnTransformer([
('categoricals', CustomEncoder(), cat_features),
('numericals', StandardScaler(), num_features)],
remainder='passthrough'
)
pipeline = Pipeline([
('preprocessing', preprocessor),
('imputing', KNNImputer(n_neighbors=5))
])
In this scenario however I cannot find a reasonable way to then set the encoded 'Missing' values back to np.nan before imputing with the KNNImputer.
I've read that I could do this manually using the OneHotEncoder transformer on this thread: Cyclical Loop Between OneHotEncoder and KNNImpute in Scikit-learn, but again, I'd like to implement all of this in a pipeline to automate the entire pre-processing phase.
Has anyone managed to do this? Would anyone recommend an alternative solution? Is imputing with a KNN algorithm maybe not worth the trouble and should I use a simple imputer instead?
Thanks in advance for your feedback!