I would like to use GridSearchCV to determine the optimal regularization parameter "C" in a logistic regression with L1 regularization. I would also like to scale/standardize my input features.
Scaling the entire training dataset with a single transform before performing the cross-validation results in data leakage: In the cross-validation, the training dataset is divided into k folds, each of which is treated once as the validation dataset, while the others are the training folds. However, if the standardization is done prior to the cross-validation on the entire training dataset, each fold (including the validation fold) will have been scaled using parameters (e.g., mean and standard deviation) that were calculated from the entire training dataset, so in a way the training folds always "know something" about the validation fold.
Thus, the proper way to scale the data would be to compute and apply the scaling for each cross-validation fold separately (i.e., on the internal training folds, holding out the validation fold in each iteration). In scikit-learn, this can be done using pipelines.
I implemented a test case to look at the difference between the two methods ("improper scaling" vs. "proper scaling with pipelines"), and when using StandardScaler, the resulting regression coefficients were the same regardless of the method, which I found surprising. However, when using RobustScaler, the resulting coefficients are different.
Why does "pipelining" the scaling make a difference for RobustScaler, but not for StandardScaler?
Thanks!
Here is my test case:
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_breast_cancer
# Choose between the two scalers:
# scaler = RobustScaler()
scaler = StandardScaler()
C_values = [0.001, 0.01, 0.05, 0.1, 1., 100.]
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
###########################################
# Version A: Proper scaling with pipeline #
###########################################
param_grid = {'logisticregression__C': C_values}
logReg = LogisticRegression(fit_intercept=True,
penalty='l1',
solver='liblinear',
tol=0.0001,
max_iter=1000,
random_state=0)
# Create a pipeline that scales, then runs logistic regression
pipeline = make_pipeline(scaler, logReg)
vA = GridSearchCV(pipeline, param_grid=param_grid,
scoring='roc_auc', cv=10, refit=True)
vA.fit(X_train, y_train)
# Get coefficients
coefA = vA.best_estimator_.named_steps['logisticregression'].coef_
###############################
# Version B: Improper scaling #
###############################
param_grid = {'C': C_values}
X_train_scaled = scaler.fit_transform(X_train)
vB = GridSearchCV(logReg, param_grid=param_grid,
scoring='roc_auc', cv=10, refit=True)
vB.fit(X_train_scaled, y_train)
# Get coefficients
coefB = vB.best_estimator_.coef_
# Compare coefficients
# (Assertion will pass for StandardScaler, but
# fail for RobustScaler)
assert np.array_equal(coefA, coefB)