1

I see papers that use 10-fold cross validation on data sets that have a number of samples indivisible by 10.

I couldn't find any case where they explained how they chose each subset.

My assumption is that they use resampling to some extent, but if this were to be the case then a sample could appear in both subsets and therefore bias the model.

Paper as example: http://www.biomedcentral.com/1471-2105/9/319

Would it be recommended to do the following;

  • Given a sample size of 86 take 8 samples as a holdout set.
  • Use the remaining samples to train.
  • repeat 10 times.

Doing it this way would have it so every sample is a training set but only 80/86 samples are used as holdouts and there is no bias of having it occur within both a training and holdout set.

Any insight would be appreciated.

zacdav
  • 4,603
  • 2
  • 16
  • 37

1 Answers1

4

You want the folds to have equal size, or as close to equal as possible.

To do this, if you have 86 samples and want to use 10 fold CV, then the first 86 % 10 = 6 folds will have size 86 / 10 + 1 = 9 and the rest will have size 86 / 10 = 8:

6 * 9 = 54
4 * 8 = 32   +
--------------
86

In general, if you have n samples and n_folds folds, you want to do what scikit-learn does:

The first n % n_folds folds have size n // n_folds + 1, other folds have size n // n_folds.

Note: // stands for integer division

I'm not aware of a proper scientific reference for this, but it seems to be the convention. See this question and also this one for the same suggestions. At least two major machine learning libraries do it this way.

Community
  • 1
  • 1
IVlad
  • 43,099
  • 13
  • 111
  • 179