Questions tagged [cross-validation]

Cross-Validation is a method of evaluating and comparing predictive systems in statistics and machine learning.

Cross-Validation is a statistical method of evaluating and comparing learning algorithms by dividing data into two segments: one used to learn or train a model and the other used to validate the model.

In typical cross-validation, the training and validation sets must cross-over in successive rounds such that each data point has a chance of being validated against. The basic form of cross-validation is k-fold cross-validation.

Other forms of cross-validation are special cases of k-fold cross-validation or involve repeated rounds of k-fold cross-validation.

2604 questions
17
votes
3 answers

Grid Search and Early Stopping Using Cross Validation with XGBoost in SciKit-Learn

I am fairly new to sci-kit learn and have been trying to hyper-paramater tune XGBoost. My aim is to use early stopping and grid search to tune the model parameters and use early stopping to control the number of trees and avoid overfitting. As I am…
George
  • 674
  • 2
  • 7
  • 19
17
votes
2 answers

How to customize sklearn cross validation iterator by indices?

Similar to Custom cross validation split sklearn I want to define my own splits for GridSearchCV for which I need to customize the built in cross-validation iterator. I want to pass my own set of train-test indices for cross validation to the…
tangy
  • 3,056
  • 2
  • 25
  • 42
16
votes
1 answer

How to apply oversampling when doing Leave-One-Group-Out cross validation?

I am working on an imbalanced data for classification and I tried to use Synthetic Minority Over-sampling Technique (SMOTE) previously to oversampling the training data. However, this time I think I also need to use a Leave One Group Out (LOGO)…
16
votes
1 answer

How to standardize data with sklearn's cross_val_score()

Let's say I want to use a LinearSVC to perform k-fold-cross-validation on a dataset. How would I perform standardization on the data? The best practice I have read is to build your standardization model on your training data then apply this model to…
als5ev
  • 175
  • 1
  • 5
16
votes
2 answers

Cross validation with grid search returns worse results than default

I'm using scikitlearn in Python to run some basic machine learning models. Using the built in GridSearchCV() function, I determined the "best" parameters for different techniques, yet many of these perform worse than the defaults. I include the…
16
votes
3 answers

What is the difference between cross_val_score with scoring='roc_auc' and roc_auc_score?

I am confused about the difference between the cross_val_score scoring metric 'roc_auc' and the roc_auc_score that I can just import and call directly. The documentation…
16
votes
2 answers

Difference between glmnet() and cv.glmnet() in R?

I'm working on a project that would show the potential influence a group of events have on an outcome. I'm using the glmnet() package, specifically using the Poisson feature. Here's my code: # de <- data imported from sql connection x <-…
Sean Branchaw
  • 597
  • 1
  • 5
  • 21
16
votes
2 answers

I have much more than three elements in every class, but I get this error: "class cannot be less than k=3 in scikit-learn"

This is my target (y): target = [7,1,2,2,3,5,4, 1,3,1,4,4,6,6, 7,5,7,8,8,8,5, 3,3,6,2,7,7,1, 10,3,7,10,4,10, 2,2,2,7] I do not know why while I'm executing: ... # Split the data set in two equal parts X_train, X_test,…
postgres
  • 2,242
  • 5
  • 34
  • 50
15
votes
2 answers

Trying to Understand FB Prophet Cross Validation

I have a dataset with 84 Monthly Sales (from 01/2013 to 12/2019) - just months, not days. Month 01 | Sale 1 Month 02 | Sale 2 Month 03 | Sale 3 .... | ... Month 84 | Sale 84 By visualization it looks like that the model fits very well...…
15
votes
0 answers

Nested cross-validation example on Scikit-learn

I'm trying to work my head around the example of Nested vs. Non-Nested CV in Sklearn. I checked multiple answers but I am still confused on the example. To my knowledge, a nested CV aims to use a different subset of data to select the best…
NCL
  • 355
  • 2
  • 4
  • 12
15
votes
2 answers

Sklearn: Cross validation for grouped data

I am trying to implement a cross validation scheme on grouped data. I was hoping to use the GroupKFold method, but I keep getting an error. what am I doing wrong? The code (slightly different from the one I used--I had different data so I had a…
sw007sw
  • 161
  • 1
  • 5
15
votes
2 answers

Saving a cross-validation trained model in Scikit

I have trained a model in scikit-learn using Cross-Validation and Naive Bayes classifier. How can I persist this model to later run against new instances? Here is simply what I have, I can get the CV scores but I don't know how to have access to the…
Ali
  • 1,605
  • 1
  • 13
  • 19
15
votes
2 answers

sklearn Kfold acces single fold instead of for loop

After using cross_validation.KFold(n, n_folds=folds) I would like to access the indexes for training and testing of single fold, instead of going through all the folds. So let's take the example code: from sklearn import cross_validation X =…
NumesSanguis
  • 5,832
  • 6
  • 41
  • 76
15
votes
1 answer

CARET. Relationship between data splitting and trainControl

I have carefully read the CARET documentation at: http://caret.r-forge.r-project.org/training.html, the vignettes, and everything is quite clear (the examples on the website help a lot!), but I am still a confused about the relationship between two…
Amelio Vazquez-Reina
  • 91,494
  • 132
  • 359
  • 564
14
votes
1 answer

Why when I use GridSearchCV with roc_auc scoring, the score is different for grid_search.score(X,y) and roc_auc_score(y, y_predict)?

I am using stratified 10-fold cross validation to find model that predicts y (binary outcome) from X (X has 34 labels) with the highest auc. I set the GridSearchCV: log_reg = LogisticRegression() parameter_grid = {'penalty' : ["l1", "l2"],'C':…
huda95x
  • 149
  • 1
  • 1
  • 5