Cross-validation in sklearn: do I need to call fit() as well as cross_val_score()?

Question

I would like to use k-fold cross validation while learning a model. So far I am doing it like this:

# splitting dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(dataset_1, df1['label'], test_size=0.25, random_state=4222)

# learning a model
model = MultinomialNB()
model.fit(X_train, y_train)
scores = cross_val_score(model, X_train, y_train, cv=5)

At this step I am not quite sure whether I should use model.fit() or not, because in the official documentation of sklearn they do not fit but just call cross_val_score as following (they do not even split the data into training and test sets):

from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)

I would like to tune the hyper parameters of the model while learning the model. What is the right pipeline?

You do not need to do a split into train+test, as that is done for model performance evaluation. CV does exactly the same (performance evaluation) just is a more robust way. This comment does not apply if you have a more complex scenario in mind and want to optimise hyperparameters or do other advances procedures. — Mischa Lisovyi, May 14 '18 at 12:06

Bert Kellerman · Answer 1 · 2018-05-14T12:34:21.897

If you want to do hyperparameter selection then look into RandomizedSearchCV or GridSearchCV. If you want to use the best model afterwards, then call either of these with refit=True and then use best_estimator_.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV

log_params = {'penalty': ['l1', 'l2'], 'C': [1E-7, 1E-6, 1E-6, 1E-4, 1E-3]}
clf = LogisticRegression()
search = RandomizedSearchCV(clf, scoring='average_precision', cv=10,
                            n_iter=10, param_distributions=log_params,
                            refit=True, n_jobs=-1)
search.fit(X_train, y_train)
clf = search.best_estimator_

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

markus-hinsche · Accepted Answer · 2018-05-14T11:53:29.740

13

Your second example is right for doing the cross validation. See the example here: http://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics

The fitting will be done inside the cross_val_score function, you don't need to worry about this beforehand.

[Edited] If, besides cross validation, you want to train a model, you can call model.fit() afterwards.

edited May 14 '18 at 11:53

answered May 14 '18 at 11:40

markus-hinsche

1,372
15
26

13

But if you want to use the model afterwards, you will need to `fit()` on the whole data again – Vivek Kumar May 14 '18 at 11:48

Cross-validation in sklearn: do I need to call fit() as well as cross_val_score()?

2 Answers2