Questions tagged [cross-validation]

Cross-Validation is a method of evaluating and comparing predictive systems in statistics and machine learning.

Cross-Validation is a statistical method of evaluating and comparing learning algorithms by dividing data into two segments: one used to learn or train a model and the other used to validate the model.

In typical cross-validation, the training and validation sets must cross-over in successive rounds such that each data point has a chance of being validated against. The basic form of cross-validation is k-fold cross-validation.

Other forms of cross-validation are special cases of k-fold cross-validation or involve repeated rounds of k-fold cross-validation.

2604 questions
1
vote
1 answer

Stratified KFold on sparse(csr) feature matrix

I have a large sparse matrix (95000, 12000) containing the features of my model. I want to do a stratified K fold cross validation using Sklearn.cross_validation module in python. However, I haven't found a way of indexing a sparse matrix in…
1
vote
1 answer

Calculating AUC Leave-One-Out cross validation in mlR?

This is a quick question, just to make sure I'm not doing this the dumb way. I want to use auc as my measure in mlr, and I'm also using LOO due to the small sample size. Of course, in the LOO cross validation scheme the test sample is always only…
catastrophic-failure
  • 3,759
  • 1
  • 24
  • 43
1
vote
1 answer

Compute model efficiency in a cross validation leave one subject out mode in R

I have a dataframe df structure(list(x = c(49, 50, 51, 52, 53, 54, 55, 56, 1, 2, 3, 4, 5, 14, 15, 16, 17, 163, 164, 165, 153, 154, 72, 38, 39, 40, 23, 13, 14, 15, 5, 6, 74, 75, 76, 77, 78, 79, 80, 81, 82, 127, 128, 129, 130, 131,…
SimonB
  • 670
  • 1
  • 10
  • 25
1
vote
2 answers

Compute Random Forest with a leave one ID out cross validation in R

I have a dataframe df dput(df) structure(list(ID = c(4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 8, 8, 8, 9, 9), Y = c(2268.14043972082, 2147.62290922552, 2269.1387550775, 2247.31983098201, 1903.39138268307, 2174.78291538358,…
SimonB
  • 670
  • 1
  • 10
  • 25
1
vote
0 answers

cv.glmnet Ridge Regression lambda.min = lambda.1se?

I'm currently running a ridge regression in R using the glmnet package, however, I recently ran into a new problem and was hoping for some help in interpreting my results. My data can be found here:…
dwm8
  • 309
  • 3
  • 16
1
vote
1 answer

What is the meaning of the GridSearchCV best_score_ attribute? (the value is different from the mean of the cross validation array)

I'm confused with the results, probably I'm not getting the concept of cross validation and GridSearch right. I had followed the logic behind this post:…
1
vote
1 answer

10 fold cross validation with sample size that is not a factor of 10

I see papers that use 10-fold cross validation on data sets that have a number of samples indivisible by 10. I couldn't find any case where they explained how they chose each subset. My assumption is that they use resampling to some extent, but if…
zacdav
  • 4,603
  • 2
  • 16
  • 37
1
vote
1 answer

SciKit Learn feature selection and cross validation using RFECV

I am still very new to machine learning and trying to figure things out myself. I am using SciKit learn and have a data set of tweets with around 20,000 features (n_features=20,000). So far I achieved a precision, recall and f1 score of around 79%.…
1
vote
0 answers

cross validation matlab toolbox issue

Labels=[1; 0]; k=5; groups = Labels; cvFolds = crossvalind('Kfold', groups, k); I am getting error of no bio informatics toolbox. Is there a way I could rewrite this function without using crossvalind?
shr m
  • 27
  • 1
1
vote
1 answer

WEKA cross validation discretization

I'm trying to improve the accuracy of my WEKA model by applying an unsupervised discretize filter. I need to decided on the number of bins and whether equal frequency binning should be used. Normally, I would optimize this using a training set.…
user3197231
  • 123
  • 3
  • 8
1
vote
1 answer

How to precompute foldid with even observations per fold for glmnet

According to the glmnet vignette, a foldid can be set up by: foldid=sample(1:10,size=length(y),replace=TRUE) However, if you look at the number of observations in each of the folds: > table(foldid) foldid 1 2 3 4 5 6 7 8 9 10 10 12 8 7…
fumikos
  • 51
  • 5
1
vote
1 answer

Is there a discrepancy between createMultiFolds behavior and the resampling summary of a caret object?

I encountered a strange issue using custom folds for the cross-validation with caret. A MWE (in which the use of createMultiFolds doesn't really make sense) library(caret) #version 6.0-47 data(iris) set.seed(1) train.idx <-…
JeromeLaurent
  • 327
  • 3
  • 10
1
vote
2 answers

Multiple cross-validation + testing on a small dataset to improve confidence

I am currently working on a very small dataset of about 25 samples (200 features) and I need to perform model selection and also have a reliable classification accuracy. I was planning to split the dataset in a training set (for a 4-fold CV) and a…
lcit
  • 306
  • 3
  • 12
1
vote
0 answers

Obtain ROC curve in cross-validation of Logistic Regression in MATLAB

I'm trying calculate the ROC curve of a cross-validation. In particular, the parameter AUC (Area under the curve) and OPTROCPT (Optimal ROC Point). I thing I can calculate them by averaging the AUC and th OptROCPt of each iteration, but I didn't get…
Frank
  • 11
  • 2
1
vote
0 answers

How to fit a GLM to a dataset estimating "only the post hoc values for the random effects"?

My goal is to implement a cross-validation procedure for linear mixed models. Let me start with what I want to do (which is described here), and already tell you that I get stuck at step 4. The goal: Fit a GLM to the data with one subject removed…
JBJ
  • 866
  • 9
  • 21