Multiple cross-validation + testing on a small dataset to improve confidence

Question

I am currently working on a very small dataset of about 25 samples (200 features) and I need to perform model selection and also have a reliable classification accuracy. I was planning to split the dataset in a training set (for a 4-fold CV) and a test set (for testing on unseen data). The main problem is that the resulting accuracy obtained from the test set is not reliable enough.

So, performing multiple time the cross-validation and testing could solve the problem?

I was planning to perform multiple times this process in order to have a better confidence on the classification accuracy. For instance: I would run one cross-validation plus testing and the output would be one "best" model plus the accuracy on the test set. The next run I would perform the same process, however, the "best" model may not be the same. By performing this process multiple times I eventually end up with one predominant model and the accuracy will be the average of the accuracies obtained on that model.

Since I never heard about a testing framework like this one, does anyone have any suggestion or critics on the algorithm proposed?

Thanks in advance.

score 0 · Answer 1 · edited May 23 '17 at 12:14

0

The algorithm seems interesting but you need to make lots of passes through data and ensure that some specific model is really dominant (that it surfaces in real majority of tests, not just 'more than others'). In general, in ML a real problem is having too little data. As anyone will tell you, not the team with the most complicated algorithm wins, but the team with biggest amount of data.

In your case I would also suggest one additional approach - bootstrapping. Details are here: what is the bootstrapped data in data mining?

Or can be googled. Long story short it is a sampling with replacement, which should help you to expand your dataset from 25 samples to something more interesting.

edited May 23 '17 at 12:14

Community

1
1

answered Jul 20 '15 at 18:05

Maksim Khaitovich

4,742
7
39
70

Thank you for your answer. Yes, the model selected must be predominant by a great deal. Concerning the boostrapping, sampling with replacement could be easily used instead of the cross-validation (the two processes are very similar) but the test set must be collected before in order to have unseen data. The problem of having a small test set remains. – lcit Jul 20 '15 at 18:41
I would just try bootstrap the full dataset and do not have any test set at all. You have to little data to have all 3 datasets, so probably you'd better get as much out of your data as you could with sampling and just prey that your model works fine on real test data (which in the end you will get somewhere). Also you may want to ask it on http://stats.stackexchange.com/, maybe guys there will have better ideas. – Maksim Khaitovich Jul 20 '15 at 19:49

score 0 · Answer 2 · answered Jul 21 '15 at 20:57

0

When the data is small like yours you should consider 'LOOCV' or leave one out cross validation. In this case you partition the data into 25 different samples where and each one a single different observatin is held out. Performance is then calcluated using the 25 individual held out predictions.

This will allow you to use the most data in your modeling and you will still have a good measure of performance.

answered Jul 21 '15 at 20:57

invoketheshell

3,819
2
20
35

1

The LOOCV has usually an higher variance compared to the k-folds and it is easily to overfit. This means that the cross-validation score is biased and not reliable enough when we performs model selection or feature selection. The test set is used specifically to have a reliable scoring but with a small dataset the test set does not represent the true population, for this reason I was planning to perform multiple test with the risk of having in turn a biased result. The best solution I found thus far is to avoid model selection and use the cross-validation result. – lcit Jul 22 '15 at 11:50
2

In a famous paper, Shao (1993) showed that leave-one-out cross validation does not lead to a consistent estimate of the model. That is, if there is a true model, then LOOCV will not always find it, even with very large sample sizes. In contrast, certain kinds of leave-k-out cross-validation, where k increases with n, will be consistent. Frankly, I don’t consider this is a very important result as there is never a true model. In reality, every model is wrong, so consistency is not really an interesting property. ~ R Hyndman – invoketheshell Jul 22 '15 at 12:28

Multiple cross-validation + testing on a small dataset to improve confidence

2 Answers2