Questions tagged [cross-validation]

Cross-Validation is a method of evaluating and comparing predictive systems in statistics and machine learning.

Cross-Validation is a statistical method of evaluating and comparing learning algorithms by dividing data into two segments: one used to learn or train a model and the other used to validate the model.

In typical cross-validation, the training and validation sets must cross-over in successive rounds such that each data point has a chance of being validated against. The basic form of cross-validation is k-fold cross-validation.

Other forms of cross-validation are special cases of k-fold cross-validation or involve repeated rounds of k-fold cross-validation.

2604 questions
14
votes
3 answers

Classification report with Nested Cross Validation in SKlearn (Average/Individual values)

Is it possible to get classification report from cross_val_score through some workaround? I'm using nested cross-validation and I can get various scores here for a model, however, I would like to see the classification report of the outer loop. Any…
utengr
  • 3,225
  • 3
  • 29
  • 68
14
votes
3 answers

How to plot a learning curve for a keras experiment?

I'm training an RNN using keras and would like to see how the validation accuracy changes with the data set size. Keras has a list called val_acc in its history object which gets appended after every epoch with the respective validation set accuracy…
14
votes
2 answers

Difference between using train_test_split and cross_val_score in sklearn.cross_validation

I have a matrix with 20 columns. The last column are 0/1 labels. The link to the data is here. I am trying to run random forest on the dataset, using cross validation. I use two methods of doing this: using…
evianpring
  • 3,316
  • 1
  • 25
  • 54
14
votes
2 answers

Cross validation for glm() models

I'm trying to do a 10-fold cross validation for some glm models that I have built earlier in R. I'm a little confused about the cv.glm() function in the boot package, although I've read a lot of help files. When I provide the following…
Error404
  • 6,959
  • 16
  • 45
  • 58
13
votes
1 answer

key error not in index while cross validation

I have applied svm on my dataset. my dataset is multi-label means each observation has more than one label. while KFold cross-validation it raises an error not in index. It shows the index from 601 to 6007 not in index (I have 1...6008 data…
sariii
  • 2,020
  • 6
  • 29
  • 57
13
votes
1 answer

Do I use the same Tfidf vocabulary in k-fold cross_validation

I am doing text classification based on TF-IDF Vector Space Model.I have only no more than 3000 samples.For the fair evaluation, I'm evaluating the classifier using 5-fold cross-validation.But what confuses me is that whether it is necessary to…
lx.F
  • 131
  • 1
  • 3
13
votes
3 answers

Scikit-learn, GroupKFold with shuffling groups?

I was using StratifiedKFold from scikit-learn, but now I need to watch also for "groups". There is nice function GroupKFold, but my data are very time dependent. So similary as in help, ie number of week is the grouping index. But each week should…
gugatr0n1c
  • 377
  • 6
  • 21
13
votes
1 answer

How to nest LabelKFold?

I have a dataset with ~300 points and 32 distinct labels and I want to evaluate a LinearSVR model by plotting its learning curve using grid search and LabelKFold validation. The code I have looks like this: import numpy as np from sklearn import…
Alex
  • 759
  • 5
  • 12
13
votes
3 answers

How to Plot PR-Curve Over 10 folds of Cross Validation in Scikit-Learn

I'm running some supervised experiments for a binary prediction problem. I'm using 10-fold cross validation to evaluate performance in terms of mean average precision (average precision for each fold divided by the number of folds for cross…
12
votes
2 answers

Alternate different models in Pipeline for GridSearchCV

I want to build a Pipeline in sklearn and test different models using GridSearchCV. Just an example (please do not pay attention on what particular models are chosen): reg = LogisticRegression() proj1 = PCA(n_components=2) proj2 = MDS() proj3 =…
sooobus
  • 841
  • 1
  • 9
  • 22
12
votes
1 answer

Validation and Testing accuracy widely different

I am currently working on a dataset in kaggle. After training the model of the training data, I testing it on the validation data and got an accuracy of around 0.49. However, the same model gives an accuracy of 0.05 on the testing data. I am using…
12
votes
2 answers

Scikit-Learn: Avoiding Data Leakage During Cross-Validation

I've just been reading up on k-fold cross-validation and have realized that I'm inadvertently leaking data with my current preprocessing setup. Usually, I have a train and test dataset. I do a bunch of data imputation and one-hot encoding on my…
anon_swe
  • 8,791
  • 24
  • 85
  • 145
12
votes
1 answer

K fold cross validation using keras

It seems that k-fold cross validation in convn net is not taken seriously due to huge running time of the neural network. I have a small data-set and I am interested in doing k-fold cross validation using the example given here. Is it possible?…
motiur
  • 1,640
  • 9
  • 33
  • 61
12
votes
1 answer

Evaluating Logistic regression with cross validation

I would like to use cross validation to test/train my dataset and evaluate the performance of the logistic regression model on the entire dataset and not only on the test set (e.g. 25%). These concepts are totally new to me and am not very sure if…
S.H
  • 137
  • 1
  • 1
  • 10
12
votes
1 answer

Spark K-fold Cross Validation

I’m having some trouble understanding Spark’s cross validation. Any example I have seen uses it for parameter tuning, but I assumed that it would just do regular K-fold cross validation as well? What I want to do is to perform k-fold cross…