How to correctly perform cross validation in scikit-learn?

Question

I am trying to do a cross validation on a k-nn classifier and I am confused about which of the following two methods below conducts cross validation correctly.

training_scores = defaultdict(list)
validation_f1_scores = defaultdict(list)
validation_precision_scores = defaultdict(list)
validation_recall_scores = defaultdict(list)
validation_scores = defaultdict(list)

def model_1(seed, X, Y):
    np.random.seed(seed)
    scoring = ['accuracy', 'f1_macro', 'precision_macro', 'recall_macro']
    model = KNeighborsClassifier(n_neighbors=13)

    kfold = StratifiedKFold(n_splits=2, shuffle=True, random_state=seed)
    scores = model_selection.cross_validate(model, X, Y, cv=kfold, scoring=scoring, return_train_score=True)
    print(scores['train_accuracy'])
    training_scores['KNeighbour'].append(scores['train_accuracy'])
    print(scores['test_f1_macro'])
    validation_f1_scores['KNeighbour'].append(scores['test_f1_macro'])
    print(scores['test_precision_macro'])
    validation_precision_scores['KNeighbour'].append(scores['test_precision_macro'])
    print(scores['test_recall_macro'])
    validation_recall_scores['KNeighbour'].append(scores['test_recall_macro'])
    print(scores['test_accuracy'])
    validation_scores['KNeighbour'].append(scores['test_accuracy'])

    print(np.mean(training_scores['KNeighbour']))
    print(np.std(training_scores['KNeighbour']))
    #rest of print statments

It seems that for loop in the second model is redundant.

def model_2(seed, X, Y):
    np.random.seed(seed)
    scoring = ['accuracy', 'f1_macro', 'precision_macro', 'recall_macro']
    model = KNeighborsClassifier(n_neighbors=13)

    kfold = StratifiedKFold(n_splits=2, shuffle=True, random_state=seed)
    for train, test in kfold.split(X, Y):
        scores = model_selection.cross_validate(model, X[train], Y[train], cv=kfold, scoring=scoring, return_train_score=True)
        print(scores['train_accuracy'])
        training_scores['KNeighbour'].append(scores['train_accuracy'])
        print(scores['test_f1_macro'])
        validation_f1_scores['KNeighbour'].append(scores['test_f1_macro'])
        print(scores['test_precision_macro'])
        validation_precision_scores['KNeighbour'].append(scores['test_precision_macro'])
        print(scores['test_recall_macro'])
        validation_recall_scores['KNeighbour'].append(scores['test_recall_macro'])
        print(scores['test_accuracy'])
        validation_scores['KNeighbour'].append(scores['test_accuracy'])

    print(np.mean(training_scores['KNeighbour']))
    print(np.std(training_scores['KNeighbour']))
    # rest of print statments

I am using StratifiedKFold and I am not sure if I need for loop as in model_2 function or does cross_validate function already use the split as we are passing cv=kfold as an argument.

I am not calling fit method, is this OK? Does cross_validate calls that automatically or do I need to call fit before calling cross_validate?

Finally, how can I create confusion matrix? Do I need to create it for each fold, if yes, how can the final/average confusion matrix be calculated?

desertnaut · Accepted Answer · 2019-03-20T22:20:53.713

The documentation is arguably your best friend in such questions; from the simple example there it should be apparent that you should use neither a for loop nor a call to fit. Adapting the example to use KFold as you do:

from sklearn.model_selection import KFold, cross_validate
from sklearn.datasets import load_boston
from sklearn.tree import DecisionTreeRegressor

X, y = load_boston(return_X_y=True)
n_splits = 5
kf = KFold(n_splits=n_splits, shuffle=True)

model = DecisionTreeRegressor()
scoring=('r2', 'neg_mean_squared_error')

cv_results = cross_validate(model, X, y, cv=kf, scoring=scoring, return_train_score=False)
cv_results

Result:

{'fit_time': array([0.00901461, 0.00563478, 0.00539804, 0.00529385, 0.00638533]),
 'score_time': array([0.00132656, 0.00214362, 0.00134897, 0.00134444, 0.00176597]),
 'test_neg_mean_squared_error': array([-11.15872549, -30.1549505 , -25.51841584, -16.39346535,
        -15.63425743]),
 'test_r2': array([0.7765484 , 0.68106786, 0.73327311, 0.83008371, 0.79572363])}

how can I create confusion matrix? Do I need to create it for each fold

No one can tell you if you need to create a confusion matrix for each fold - it is your choice. If you choose to do so, it may be better to skip cross_validate and do the procedure "manually" - see my answer in How to display confusion matrix and report (recall, precision, fmeasure) for each cross validation fold.

if yes, how can the final/average confusion matrix be calculated?

There is no "final/average" confusion matrix; if you want to calculate anything further than the k ones (one for each k-fold) as described in the linked answer, you need to have available a separate validation set...

If I want to display confusion matrix for each fold, how would you do that as in my case I am using cross_validate so I don't have access to `confusion_matrix(y[val_index], pred))` as shown in your link. Because cross_validate is creating fold. — learner, Mar 20 '19 at 22:46
@learner you mean you are *bounded* in using `cross_validate`? Because the linked answer is the most straightforward way to do so, and you don't "lose" anything (scores etc) compared to your current approach - it is actually a (correct) modification of your `model_2` approach; in any case, you could check [`cross_val_predict`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html) — desertnaut, Mar 20 '19 at 23:04

mujjiga · Answer 2 · 2019-03-21T05:23:37.513

model_1 is correct.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html

cross_validate(estimator, X, y=None, groups=None, scoring=None, cv=’warn’, n_jobs=None, verbose=0, fit_params=None, pre_dispatch=‘2*n_jobs’, return_train_score=’warn’, return_estimator=False, error_score=’raise-deprecating’)

where

estimator is an object implementing ‘fit’. It will be called to fit the model on the train folds.

cv: is a cross-validation generator that is used to generated train and test splits.

If you follow the example in the sklearn docs

cv_results = cross_validate(lasso, X, y, cv=3, return_train_score=False) cv_results['test_score'] array([0.33150734, 0.08022311, 0.03531764])

You can see that the model lasso is fitted 3 times once for each fold on train splits and also validated 3 times on test splits. You can see that the test score on validation data are reported.

Cross validation of Keras models

Keras provides wrapper which makes the keras models compatible with sklearn cross_validatation method. You have to wrap the keras model using KerasClassifier

from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import KFold, cross_validate
from keras.models import Sequential
from keras.layers import Dense
import numpy as np

def get_model():
    model = Sequential()
    model.add(Dense(2, input_dim=2, activation='relu')) 
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

model = KerasClassifier(build_fn=get_model, epochs=10, batch_size=8, verbose=0)
kf = KFold(n_splits=3, shuffle=True)

X = np.random.rand(10,2)
y = np.random.rand(10,1)

cv_results = cross_validate(model, X, y, cv=kf, return_train_score=False)

print (cv_results)

How to correctly perform cross validation in scikit-learn?

2 Answers2

Cross validation of Keras models

Linked