How to create learning curve from cross-validated data?

Question

I have an algorithm which uses 10 fold cross validation. Within the training set, I use one of the folds for validation of the training model before using the learned model on the fold held aside for testing.

I would like to create a learning curve which means I need to vary the size of the training set. Does this mean I also vary the size of the validation set along with the training data? Does it mean that I need to change the size of the test set (the fold set aside from the training and validation folds) as well?

Can you clarify your first paragraph please? 10 fold CV involves splitting your data in 10 roughly equal parts. Train on 9 and test on the 10th. You seem to be setting aside 2 folds? — IVlad, Apr 15 '15 at 20:57

Felix Glas · Accepted Answer · 2015-04-16T00:07:36.100

10-fold cross-validation works by taking the training set of labeled data and dividing it into 10 equal size subsets. 9 of the subsets are combined into the new training set and the remaining 1 subset is used for validation/testing, i.e. the model is trained on 90% of the original training set and tested on 10%.

This is performed 10 times (the folds) iterating over each of the 10 subsets so each subset is used for testing. A performance measure of the testing is performed on each iteration and after all iterations are completed, the average is calculated.

There is nothing called a "training fold" or "testing fold", a fold is an iteration of the process. There is also no subsets held aside during the process, all subsets are used in each iteration.

To create the learning curve you are talking about you could simply vary the size of the original training set and let the 10-fold cross-validation process run as it is. The number of records in the original training set is your measure of the training set size and the performance is the given average on completion of the cross-validation.

Validation set is the set where (hyper)parameters are optimized, e.g. C for SVM; test set is the set where model performance is evaluated. Do you propose to optimize parameters and test the model on the same set? — Nikita Astrakhantsev, Apr 15 '15 at 22:57
@NikitaAstrakhantsev I guess you're right in that it's more correct to say "test set" instead of "validation set" when used with the meaning I intended (edited). I only talk of using CV for performance validation. There are algorithms that don't need parameter tuning but when that is the case (as with SVM and NN) you generally partition the test set into a validation and test part. — Felix Glas, Apr 16 '15 at 00:11

How to create learning curve from cross-validated data?

1 Answers1