1

I am currently working on a very small dataset of about 25 samples (200 features) and I need to perform model selection and also have a reliable classification accuracy. I was planning to split the dataset in a training set (for a 4-fold CV) and a test set (for testing on unseen data). The main problem is that the resulting accuracy obtained from the test set is not reliable enough.

So, performing multiple time the cross-validation and testing could solve the problem?

I was planning to perform multiple times this process in order to have a better confidence on the classification accuracy. For instance: I would run one cross-validation plus testing and the output would be one "best" model plus the accuracy on the test set. The next run I would perform the same process, however, the "best" model may not be the same. By performing this process multiple times I eventually end up with one predominant model and the accuracy will be the average of the accuracies obtained on that model.

Since I never heard about a testing framework like this one, does anyone have any suggestion or critics on the algorithm proposed?

Thanks in advance.

lcit
  • 306
  • 3
  • 12

2 Answers2

0

The algorithm seems interesting but you need to make lots of passes through data and ensure that some specific model is really dominant (that it surfaces in real majority of tests, not just 'more than others'). In general, in ML a real problem is having too little data. As anyone will tell you, not the team with the most complicated algorithm wins, but the team with biggest amount of data.

In your case I would also suggest one additional approach - bootstrapping. Details are here: what is the bootstrapped data in data mining?

Or can be googled. Long story short it is a sampling with replacement, which should help you to expand your dataset from 25 samples to something more interesting.

Community
  • 1
  • 1
Maksim Khaitovich
  • 4,742
  • 7
  • 39
  • 70
  • Thank you for your answer. Yes, the model selected must be predominant by a great deal. Concerning the boostrapping, sampling with replacement could be easily used instead of the cross-validation (the two processes are very similar) but the test set must be collected before in order to have unseen data. The problem of having a small test set remains. – lcit Jul 20 '15 at 18:41
  • I would just try bootstrap the full dataset and do not have any test set at all. You have to little data to have all 3 datasets, so probably you'd better get as much out of your data as you could with sampling and just prey that your model works fine on real test data (which in the end you will get somewhere). Also you may want to ask it on http://stats.stackexchange.com/, maybe guys there will have better ideas. – Maksim Khaitovich Jul 20 '15 at 19:49
0

When the data is small like yours you should consider 'LOOCV' or leave one out cross validation. In this case you partition the data into 25 different samples where and each one a single different observatin is held out. Performance is then calcluated using the 25 individual held out predictions.

This will allow you to use the most data in your modeling and you will still have a good measure of performance.

invoketheshell
  • 3,819
  • 2
  • 20
  • 35
  • 1
    The LOOCV has usually an higher variance compared to the k-folds and it is easily to overfit. This means that the cross-validation score is biased and not reliable enough when we performs model selection or feature selection. The test set is used specifically to have a reliable scoring but with a small dataset the test set does not represent the true population, for this reason I was planning to perform multiple test with the risk of having in turn a biased result. The best solution I found thus far is to avoid model selection and use the cross-validation result. – lcit Jul 22 '15 at 11:50
  • 2
    In a famous paper, Shao (1993) showed that leave-​​one-​​out cross val­i­da­tion does not lead to a con­sis­tent esti­mate of the model. That is, if there is a true model, then LOOCV will not always find it, even with very large sam­ple sizes. In con­trast, cer­tain kinds of leave-​​k-​​out cross-​​validation, where k increases with n, will be con­sis­tent. Frankly, I don’t con­sider this is a very impor­tant result as there is never a true model. In real­ity, every model is wrong, so con­sis­tency is not really an inter­est­ing property. ~ R Hyndman – invoketheshell Jul 22 '15 at 12:28