According to the DOC of scikit-learn
sklearn.model_selection.cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=None, n_jobs=1, verbose=0, fit_params=None, pre_dispatch=‘2*n_jobs’)
X and y
X : array-like The data to fit. Can be for example a list, or an array.
y : array-like, optional, default: None The target variable to try to predict in the case of supervised learning.
I am wondering whether [X,y] is X_train and y_train or [X,y] should be the whole dataset. In some of the notebooks from kaggle some people use the whole dataset and some others X_train and y_train.
To my knowledge, cross validation just evaluate the model and shows whether or not you overfit/underfit your data (it does not actually train the model). Then, in my view the most data you have the better will be the performance, so I would use the whole dataset.
What do you think?