18

I would like to use k-fold cross validation while learning a model. So far I am doing it like this:

# splitting dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(dataset_1, df1['label'], test_size=0.25, random_state=4222)

# learning a model
model = MultinomialNB()
model.fit(X_train, y_train)
scores = cross_val_score(model, X_train, y_train, cv=5)

At this step I am not quite sure whether I should use model.fit() or not, because in the official documentation of sklearn they do not fit but just call cross_val_score as following (they do not even split the data into training and test sets):

from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)

I would like to tune the hyper parameters of the model while learning the model. What is the right pipeline?

smci
  • 32,567
  • 20
  • 113
  • 146
torayeff
  • 9,296
  • 19
  • 69
  • 103
  • You do not need to do a split into train+test, as that is done for model performance evaluation. CV does exactly the same (performance evaluation) just is a more robust way. This comment does not apply if you have a more complex scenario in mind and want to optimise hyperparameters or do other advances procedures. – Mischa Lisovyi May 14 '18 at 12:06

2 Answers2

14

If you want to do hyperparameter selection then look into RandomizedSearchCV or GridSearchCV. If you want to use the best model afterwards, then call either of these with refit=True and then use best_estimator_.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV

log_params = {'penalty': ['l1', 'l2'], 'C': [1E-7, 1E-6, 1E-6, 1E-4, 1E-3]}
clf = LogisticRegression()
search = RandomizedSearchCV(clf, scoring='average_precision', cv=10,
                            n_iter=10, param_distributions=log_params,
                            refit=True, n_jobs=-1)
search.fit(X_train, y_train)
clf = search.best_estimator_

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

Bert Kellerman
  • 1,590
  • 10
  • 17
13

Your second example is right for doing the cross validation. See the example here: http://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics

The fitting will be done inside the cross_val_score function, you don't need to worry about this beforehand.

[Edited] If, besides cross validation, you want to train a model, you can call model.fit() afterwards.

markus-hinsche
  • 1,372
  • 15
  • 26