0

cross_val_scores gives different results than LogisticRegressionCV, and I can't figure out why.

Here is my code:

seed = 42
test_size = .33
X_train, X_test, Y_train, Y_test = train_test_split(scale(X),Y, test_size=test_size, random_state=seed)

#Below is my model that I use throughout the program.

model = LogisticRegressionCV(random_state=42)
print('Logistic Regression results:')
        
#For cross_val_score below, I just call LogisticRegression (and not LogRegCV) with the same parameters.

scores = cross_val_score(LogisticRegression(random_state=42), X_train, Y_train, scoring='accuracy', cv=5)
print(np.amax(scores)*100)
print("%.2f%% average accuracy with a standard deviation of %0.2f" % (scores.mean() * 100, scores.std() * 100))
        
model.fit(X_train, Y_train)
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(Y_test, predictions)

coef=np.round(model.coef_,2)

print("Accuracy: %.2f%%" % (accuracy * 100.0))

The output is this.

Logistic Regression results:
79.90483019359885
79.69% average accuracy with a standard deviation of 0.14
Accuracy: 79.81%

Why is the maximum accuracy from cross_val_score higher than the accuracy used by LogisticRegressionCV?

And, I recognize that cross_val_scores does not return a model, which is why I want to use LogisticRegressionCV, but I am struggling to understand why it is not performing as well. Likewise, I am not sure how to get the standard deviations of the predictors from LogisticRegressionCV.

1 Answers1

1

For me, there might be some points to take into consideration:

  1. Cross validation is generally used whenever you should simulate a validation set (for instance when the training set is not that big to be divided into training, validation and test sets) and only uses training data. In your case you're computing accuracy of model on test data, making it impossible to exactly compare results.
  2. According to the docs:

Cross-validation estimators are named EstimatorCV and tend to be roughly equivalent to GridSearchCV(Estimator(), ...). The advantage of using a cross-validation estimator over the canonical estimator class along with grid search is that they can take advantage of warm-starting by reusing precomputed results in the previous steps of the cross-validation process. This generally leads to speed improvements.

If you look at this snippet, you'll see that's what happens indeed:

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split

data = load_breast_cancer()
X, y = data['data'], data['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

estimator = LogisticRegression(random_state=42, solver='liblinear')
grid = {
    'C': np.power(10.0, np.arange(-10, 10)), 
}

gs = GridSearchCV(estimator, param_grid=grid, scoring='accuracy', cv=5)
gs.fit(X_train, y_train)
print(gs.best_score_)                        # 0.953846153846154

lrcv = LogisticRegressionCV(Cs=list(np.power(10.0, np.arange(-10, 10))),
                        cv=5, scoring='accuracy', solver='liblinear', random_state=42)
lrcv.fit(X_train, y_train)
print(lrcv.scores_[1].mean(axis=0).max())    # 0.953846153846154

I would suggest to have a look here, too, so as to get the details of lrcv.scores_[1].mean(axis=0).max().

  1. Eventually, to get the same results with cross_val_score you should better write:

     score = cross_val_score(gs.best_estimator_, X_train, y_train, cv=5, scoring='accuracy')
     score.mean()                             # 0.953846153846154
    
amiola
  • 2,593
  • 1
  • 11
  • 25