Questions tagged [cross-validation]

Cross-Validation is a method of evaluating and comparing predictive systems in statistics and machine learning.

Cross-Validation is a statistical method of evaluating and comparing learning algorithms by dividing data into two segments: one used to learn or train a model and the other used to validate the model.

In typical cross-validation, the training and validation sets must cross-over in successive rounds such that each data point has a chance of being validated against. The basic form of cross-validation is k-fold cross-validation.

Other forms of cross-validation are special cases of k-fold cross-validation or involve repeated rounds of k-fold cross-validation.

2604 questions
12
votes
1 answer

Caret Package: Stratified Cross Validation in Train Function

Is there a way to perform stratified cross validation when using the train function to fit a model to a large imbalanced data set? I know straight forward k fold cross validation is possible but my categories are highly unbalanced. I've seen…
Windstorm1981
  • 2,564
  • 7
  • 29
  • 57
12
votes
2 answers

Creating folds for k-fold CV in R using Caret

I'm trying to make a k-fold CV for several classification methods/hiperparameters using the data available at http://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data. This set is made of 208…
gcolucci
  • 438
  • 1
  • 5
  • 21
11
votes
1 answer

Cross-validation and parameters tuning with XGBoost and hyperopt

One way to do nested cross-validation with a XGB model would be: from sklearn.model_selection import GridSearchCV, cross_val_score from xgboost import XGBClassifier # Let's assume that we have some data for a binary classification # problem : X…
11
votes
1 answer

Compare ways to tune hyperparameters in scikit-learn

This post is about the differences between LogisticRegressionCV, GridSearchCV and cross_val_score. Consider the following setup: import numpy as np from sklearn.datasets import load_digits from sklearn.linear_model import LogisticRegression,…
11
votes
2 answers

How to implement SMOTE in cross validation and GridSearchCV

I'm relatively new to Python. Can you help me improve my implementation of SMOTE to a proper pipeline? What I want is to apply the over and under sampling on the training set of every k-fold iteration so that the model is trained on a balanced data…
MLearner
  • 164
  • 1
  • 1
  • 9
11
votes
2 answers

Using cross validation and AUC-ROC for a logistic regression model in sklearn

I'm using the sklearn package to build a logistic regression model and then evaluate it. Specifically, I want to do so using cross validation, but can't figure out the right way to do so with the cross_val_score function. According to the…
11
votes
1 answer

scikit-learn: cross_val_predict only works for partitions

I am struggling to work out how to implement TimeSeriesSplit in sklearn. The suggested answer at the link below yields the same ValueError. sklearn TimeSeriesSplit cross_val_predict only works for partitions here the relevant bit from my code: from…
11
votes
1 answer

sklearn grid search with grouped K fold cv generator

I am trying to implement a grid search over parameters in sklearn using randomized search and a grouped k fold cross-validation generator. The following…
Sam Weisenthal
  • 2,791
  • 9
  • 28
  • 66
11
votes
1 answer

Caret package - cross-validating GAM with both smooth and linear predictors

I would like to cross validate a GAM model using caret. My GAM model has a binary outcome variable, an isotropic smooth of latitude and longitude coordinate pairs, and then linear predictors. Typical syntax when using mgcv is: gam1 <- gam( y ~ s(lat…
Paul Lantos
  • 113
  • 1
  • 6
11
votes
1 answer

How to get classes labels from cross_val_predict used with predict_proba in scikit-learn

I need to train a Random Forest classifier using a 3-fold cross-validation. For each sample, I need to retrieve the prediction probability when it happens to be in the test set. I am using scikit-learn version 0.18.dev0. This new version adds the…
gc5
  • 9,468
  • 24
  • 90
  • 151
11
votes
2 answers

Spark CrossValidatorModel access other models than the bestModel?

I am using Spark 1.6.1: Currently I am using a CrossValidator to train my ML Pipeline with various parameters. After the training process I can use the bestModel property of the CrossValidatorModel to get the Model that performed best during the…
11
votes
2 answers

StratifiedKFold vs StratifiedShuffleSplit vs StratifiedKFold + Shuffle

What is the difference between: StratifiedKFold, StratifiedShuffleSplit, StratifiedKFold + Shuffle? When should I use each one? When I get a better accuracy score? Why I do not get similar results? I have put my code and the results. I am using…
Aizzaac
  • 3,146
  • 8
  • 29
  • 61
11
votes
2 answers

sklearn: User defined cross validation for time series data

I'm trying to solve a machine learning problem. I have a specific dataset with time-series element. For this problem I'm using well-known python library - sklearn. There are a lot of cross validation iterators in this library. Also there are several…
Demyanov
  • 901
  • 2
  • 10
  • 15
11
votes
1 answer

Log transform dependent variable for regression tree

I have a dataset where I find that the dependent (target) variable has a skewed distribution - i.e. there are a few very large values and a long tail. When I run the regression tree, one end-node is created for the large-valued observations and one…
airjordan707
  • 111
  • 1
  • 1
  • 4
11
votes
3 answers

R: Cross validation on a dataset with factors

Often, I want to run a cross validation on a dataset which contains some factor variables and after running for a while, the cross validation routine fails with the error: factor x has new levels Y. For example, using package boot: library(boot) d…
musically_ut
  • 34,028
  • 8
  • 94
  • 106