Cross validation: cross_val_score function from scikit-learn arguments

Question

According to the DOC of scikit-learn

sklearn.model_selection.cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=None, n_jobs=1, verbose=0, fit_params=None, pre_dispatch=‘2*n_jobs’)

X and y

X : array-like The data to fit. Can be for example a list, or an array.

y : array-like, optional, default: None The target variable to try to predict in the case of supervised learning.

I am wondering whether [X,y] is X_train and y_train or [X,y] should be the whole dataset. In some of the notebooks from kaggle some people use the whole dataset and some others X_train and y_train.

To my knowledge, cross validation just evaluate the model and shows whether or not you overfit/underfit your data (it does not actually train the model). Then, in my view the most data you have the better will be the performance, so I would use the whole dataset.

What do you think?

It is up to you. In some cases, people do their whole data analysis - including cross validation - on the train set, and only in the end use the test set. — Ami Tavory, May 04 '18 at 14:07

Mihai Alexandru-Ionut · Accepted Answer · 2018-05-04T14:26:07.297

Model performance is dependent on way the data is split and sometimes model does not have ability to generalize.

So that's why we need the cross validation.

Cross-validation is a vital step in evaluating a model. It maximizes the amount of data that is used to train the model, as during the course of training, the model is not only trained, but also tested on all of the available data.

I am wondering whether [X,y] is X_train and y_train or [X,y] should be the whole dataset.

[X, y] should be the whole dataset because internally cross validation spliting the data into training data and test data.

Suppose you use cross validation with 5 folds (cv = 5).

We begin by splitting the dataset into five groups or folds. Then we hold out the first fold as a test set, fit out model on the remaining four folds, predict on the test set and compute the metric of interest.

Next, we hold out the second fold as out test set, fit on the remaining data, predict on the test set and compute the metric of interest.

By default, scikit-learn's cross_val_score() function uses R^2 score as the metric of choice for regression.

R^2 score is called coefficient of determination.

`cross_val_score` uses the `score()` method of supplied estimator as default. And for regressor estimators, `score()` calculates the R_squared value. Hence `cross_val_score()` gives out this. — Vivek Kumar, May 05 '18 at 06:16

Cross validation: cross_val_score function from scikit-learn arguments

1 Answers1