Imputing missing values in nested GridSearchCV pipeline to avoid data leakage

Question

I am having some issues with sklearn's way to impute values inside of the established CV and Pipeline frameworks. All of this is to avoid global imputation, which will perturb the models performance due to data leakage. Looking around at several links and guides, I have mix and matched and established what is in the following code snippet. I am trying to use this on several linear models, but the example will stick to Lasso.

My dataset consists of 95 numerical parameters and 5 categorical ones. There are NaN's present throughout here (8-27%, columnwise) in the total of 100 observations. There are no NaN's in my response, y.

Here I try to impute using KNN, scale the data accordingly and for the categorical variables use most frequent imputing and one hot encode it, respectively.

from sklearn import linear_model
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.model_selection import KFold
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import LabelEncoder, StandardScaler, OrdinalEncoder
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, LassoLars
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold

# Data prep
"""
We have 100 parameters; 95 are numerical and 5 are categorical.
Each row contains numerous missing values, which we need to impute *inside* the CV-loop
"""
df = pd.read_csv(dataPath, delimiter=',', skipinitialspace = True)
y = df.loc[:, 'y']
df = df.drop(['y'], axis=1)
df_num = df[df.columns[:95]]
df_cat = df[df.columns[95:]]

num_na = df_num.columns.to_list()
cat_na = df_cat.columns.to_list()


# Apply KNNImputing, scale afterwards
numeric_transformer = Pipeline(steps=[
   ('imputer', KNNImputer(n_neighbors=2, weights="uniform")),
   ('scaler', StandardScaler())])
# Most common occurrence imputing and dummy encoding
categorical_transformer = Pipeline(steps=[
   ('imputer', SimpleImputer(strategy='most_frequent')),
   ('onehot', OrdinalEncoder())])
# Do it for the established columns in df
preprocessor = ColumnTransformer(
   remainder = 'passthrough',
   transformers=[
       ('numeric', numeric_transformer, num_na),
       ('categorical', categorical_transformer, cat_na)
])

# Nested CV can be performed on the configured GridSearchCV directly that will 
# automatically use the refit best performing model on the test set from the outer loop
cv_inner = KFold(n_splits=3, shuffle=True, random_state=42) # Parameter validation
cv_outer = KFold(n_splits=5, shuffle=True, random_state=42) # Model validation
# define the model
pipe = Pipeline(steps=[('preprocessing', preprocessor),
                       ('clf', Lasso())]
)
# Gridsearch over parameter grid
alpha_grid = np.logspace(-4, 3, 100)
param_grid = [{'clf__alpha': alpha_grid}]
grid = GridSearchCV(pipe, cv=cv_inner, param_grid=param_grid, verbose=1, 
                return_train_score=True, scoring='neg_root_mean_squared_error',
                refit=True, n_jobs=-1)



scores = cross_val_score(grid, df, y, scoring='neg_root_mean_squared_error',
                        cv=cv_outer, n_jobs=-1)

print('Avg. RMSE across outer CV: %.3f ' % (np.mean(-scores)))

When performing this framework I get

        RuntimeWarning: invalid value encountered in divide
  * (last_sum / last_over_new_count - new_sum) ** 2

eluding to some division by the NaN's present. Is there any brilliant scikit-learn minds that can sanity check me in this regard?

Thanks in advance.

Help narrow this down: Does it work without warning to fit the pipeline without grid search + cross-val-score? What if you remove individual transformers? — Ben Reiniger, Mar 16 '23 at 15:10
I think this might happen in the scaler if one of your features is all-missing in one of the sub-folds? — Ben Reiniger, Mar 16 '23 at 15:12
Hm, I think this should be handled by https://github.com/scikit-learn/scikit-learn/blob/ab3c0605de29d9f0662d27c1873756b8bac7047f/sklearn/utils/extmath.py#L1076 ; it's been there a while, what version of sklearn are you using? — Ben Reiniger, Mar 16 '23 at 15:18
It should not be missing, as my pipeline (when I checked it before parsing) does indeed replace the NaN values. Hence, inside each fold to my understanding, it should impute --> encode/scale --> fit. Good suggestion, I will try to remove individual parts, however then it will no longer be nested? My sklearn.__version__ is 1.2.0 — Grok, Mar 16 '23 at 15:23

Imputing missing values in nested GridSearchCV pipeline to avoid data leakage

0 Answers0