1

I would like to use GridSearchCV to determine the optimal regularization parameter "C" in a logistic regression with L1 regularization. I would also like to scale/standardize my input features.

Scaling the entire training dataset with a single transform before performing the cross-validation results in data leakage: In the cross-validation, the training dataset is divided into k folds, each of which is treated once as the validation dataset, while the others are the training folds. However, if the standardization is done prior to the cross-validation on the entire training dataset, each fold (including the validation fold) will have been scaled using parameters (e.g., mean and standard deviation) that were calculated from the entire training dataset, so in a way the training folds always "know something" about the validation fold.

Thus, the proper way to scale the data would be to compute and apply the scaling for each cross-validation fold separately (i.e., on the internal training folds, holding out the validation fold in each iteration). In scikit-learn, this can be done using pipelines.

I implemented a test case to look at the difference between the two methods ("improper scaling" vs. "proper scaling with pipelines"), and when using StandardScaler, the resulting regression coefficients were the same regardless of the method, which I found surprising. However, when using RobustScaler, the resulting coefficients are different.

Why does "pipelining" the scaling make a difference for RobustScaler, but not for StandardScaler?

Thanks!

Here is my test case:

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_breast_cancer

# Choose between the two scalers:
# scaler = RobustScaler()
scaler = StandardScaler()  

C_values = [0.001, 0.01, 0.05, 0.1, 1., 100.]

cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

###########################################
# Version A: Proper scaling with pipeline #
###########################################

param_grid = {'logisticregression__C': C_values}

logReg = LogisticRegression(fit_intercept=True, 
                            penalty='l1', 
                            solver='liblinear', 
                            tol=0.0001, 
                            max_iter=1000, 
                            random_state=0)

# Create a pipeline that scales, then runs logistic regression
pipeline = make_pipeline(scaler, logReg)

vA = GridSearchCV(pipeline, param_grid=param_grid,
                     scoring='roc_auc', cv=10, refit=True)
vA.fit(X_train, y_train)

# Get coefficients
coefA = vA.best_estimator_.named_steps['logisticregression'].coef_

###############################
# Version B: Improper scaling #     
###############################

param_grid = {'C': C_values}

X_train_scaled = scaler.fit_transform(X_train)

vB = GridSearchCV(logReg, param_grid=param_grid,
                     scoring='roc_auc', cv=10, refit=True)
vB.fit(X_train_scaled, y_train)

# Get coefficients
coefB = vB.best_estimator_.coef_


# Compare coefficients
# (Assertion will pass for StandardScaler, but 
# fail for RobustScaler)
assert np.array_equal(coefA, coefB)
mella
  • 21
  • 2
  • 4

1 Answers1

1

First things first, here its just a co-incidence that StandardScaler dont change the value of coef_ because of the random_state and cv you have chosen. If you change the cv=10 to something like cv=3 or 4 and remove the random_state, you will also get different coef_ values for StandardScaler.

Now about the explanation:

You see, the lines to observe here in first method is:

vA.fit(X_train, y_train)

Now vA is a gridsearch and will do cross-validation by splitting the X_train, y_train into further train and test and find the best parameters and then fit the whole X_train, y_train. That means the pipeline will be fit on the whole data. So it doesnt matter here that you use StandardScaler or RobustScaler.

Now in method 2 you are doing:

X_train_scaled = scaler.fit_transform(X_train)

So you are using the same data on the scaler in both methods. The scalers in both the methods will be fit on exact same data and learn exact same scale_ or mean_ or other attributes.

So having that out of the way, lets check if the exact same LogisticRegression is being fit or not.

Do this in your method1:

>> print(vA.best_params_)
#Output: {'logisticregression__C': 1.0}

And this in method2:

>> print(vB.best_params_)
#Output: {'C': 1}   for StandardScaler
#Output: {'C': 0.1}   for RobustScaler

So you see, the difference in coef_ is due to difference in C in LogReg. That C which the grid_search found best in StandardScaler is same in both the methods (equal to 1.0), but not for RobustScaler.

So the internal splitting happening in the GridSearchCV is then passed to RobustScaler which scales the data differently and hence a different C is found as best.

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • I did not expect the two scalers to yield the exact same regression coefficients. However what I did expect was that the way the scaling is done (integrated in the pipeline vs. before the GridSearch) would make a difference - it actually did for RobustScaler, but not for StandardScaler. Are you saying even when using a pipeline that scaling is done on the entire training set? That would contradict what was explained e.g. here: https://stackoverflow.com/a/24022277/9482114 – mella Mar 14 '18 at 21:42
  • @mella. No I am not saying that it will do same scaling inside grid-search. I am saying whatever is passes into pipeline will be scaled as whole. These are two different concepts. Please read the question carefully and if not, ask and I'll explain. – Vivek Kumar Mar 15 '18 at 04:18
  • I don't think that whatever gets passed into the pipeline will be scaled as a whole - at least in the question I linked to above, it was confirmed that when doing CV, for a given fold the algorithm standardizes only the train set in that fold (i.e., it does not include the fold's test set for determining parameters of the scaler such as mean or variance). – mella Mar 15 '18 at 13:04