22

I'm trying to understand using kfolds cross validation from the sklearn python module.

I understand the basic flow:

  • instantiate a model e.g. model = LogisticRegression()
  • fitting the model e.g. model.fit(xtrain, ytrain)
  • predicting e.g. model.predict(ytest)
  • use e.g. cross val score to test the fitted model accuracy.

Where i'm confused is using sklearn kfolds with cross val score. As I understand it the cross_val_score function will fit the model and predict on the kfolds giving you an accuracy score for each fold.

e.g. using code like this:

kf = KFold(n=data.shape[0], n_folds=5, shuffle=True, random_state=8)
lr = linear_model.LogisticRegression()
accuracies = cross_val_score(lr, X_train,y_train, scoring='accuracy', cv = kf)

So if I have a dataset with training and testing data, and I use the cross_val_score function with kfolds to determine the accuracy of the algorithm on my training data for each fold, is the model now fitted and ready for prediction on the testing data? So in the case above using lr.predict

desertnaut
  • 57,590
  • 26
  • 140
  • 166
hselbie
  • 1,749
  • 9
  • 24
  • 40
  • I don't believe so, but you should look into `GridSearchCV`. I almost always use this instead of `cross_val_score` because it's basically like a model that you can fit and predict on, and is useful for tuning the parameters of your model. If you don't want to tune any parameters, you can pass `{}`. – justincai Feb 16 '17 at 03:07
  • THIS question, that is more recent, should be closed, not the one that was actually closed that was asked first. I really hate it when people arbitrarily close questions without any good judgment. – user Jan 08 '22 at 18:18

1 Answers1

28

No the model is not fitted. Looking at the source code for cross_val_score:

scores=parallel(delayed(_fit_and_score)(clone(estimator),X,y,scorer,
                                        train,test,verbose,None,fit_params)

As you can see, cross_val_score clones the estimator before fitting the fold training data to it. cross_val_score will give you output an array of scores which you can analyse to know how the estimator performs for different folds of the data to check if it overfits the data or not. You can know more about it here

You need to fit the whole training data to the estimator once you are satisfied with the results of cross_val_score, before you can use it to predict on test data.

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • Is there any way of getting the training and testing time when using cross_val_score? As far as I can see in the source code, _fit_and_score returns a fit_time and a score_time but I am not sure if there is any way of retrieving those when using cross_val_score. – No Reply Mar 25 '17 at 14:50
  • 1
    After cross_val_score if I get scores for 10 folds, how to apply final average model to make prediction on test data? I don't understand this moment how to get final model. – Evgeny Nov 19 '20 at 08:40
  • 2
    Cross validation is just to check the model performance on data distribution. Once you are satisfied with that, you will need to train a new model with full data – Vivek Kumar Nov 19 '20 at 08:50