R mlr - Creating learning curve from subset of training data and whole test data (not whole training data)?

Question

let's say I'm creating such learning curve (possible little errors in code, it's just a sample). What I want is rather a classical learning curve, where you make enlarge the training set keeping the validation/test set the same size.

learningCurve <- generateLearningCurveData("regr.glmnet",
                                           bh.task,
                                           makeResampleDesc(method = "cv", iters = 5, predict = "both"),
                                           seq(0.1, 1, by = 0.1),
                                           list(setAggregation(auc, train.mean), setAggregation(auc, test.mean))
)

The problem with the code above is that the learners are indeed trained on the fraction of training data, but the auc.train.mean measure is evaluated on the whole training set. This results in not really the learning curve I want. I would like this measure to evaluate the performance on the fraction of the training set that was used for learning, like here:

http://www.astroml.org/sklearn_tutorial/practical.html#learning-curves

I believe this sentence explains it all:

Note that when we train on a small subset of the training data, the training error is computed using this subset, not the full training set.

How to achieve this?

`train.mean` should give you the performance on the training data that you're looking for, see https://mlr-org.github.io/mlr-tutorial/devel/html/learning_curve/index.html. Are you getting numbers that don't make sense? — Lars Kotthoff, Nov 23 '16 at 17:53
Yeah I've seen that page and I'm using it extensively. I'm not saying the results are not meaningful - they indeed are, but they are not what I am looking for. The thing is that when you train data on the 10% of the training data, the `train.mean` still measures the performance on the 100% of the training data (I checked). The result is then that both "train errror" curve and "test error" curve go down with sample increasing, where in classical "learning curves" the train error most often increase, like in the link from scikit that I provided. Not sure if this is clear. — Matek, Nov 24 '16 at 08:47
My reading of the code is that it happens as you describe it should. Do you have a direct comparison between mlr and scikit-learn that shows that this is not the case? — Lars Kotthoff, Nov 24 '16 at 18:18
It's too long to put in comment. Check these two codes if you can. The results are arguably similar, but I believe the point is obvious. Mlr trains on whole training data, whereas scikit trains on the subset of training data (which is what I am trying to achieve). [Mlr code](http://pastebin.com/4Js3jd99) [Scikit code](http://pastebin.com/F3z3FnBc) — Matek, Nov 24 '16 at 22:02
Thanks, that helps. I don't have time to look into this at the moment, but I've opened an issue: https://github.com/mlr-org/mlr/issues/1357 — Lars Kotthoff, Nov 25 '16 at 19:13

score 1 · Accepted Answer · answered Nov 27 '16 at 06:37

1

The fix for this issue is in this pull request, which should be merged soon.

With the fix in place, I get the following learning curve for the full example in the comments:

answered Nov 27 '16 at 06:37

Lars Kotthoff

107,425
16
204
204

score 0 · Answer 2 · answered Nov 26 '16 at 15:52

0

As a reference for future readers, this will be fixed and here's the github issue

https://github.com/mlr-org/mlr/issues/1357

answered Nov 26 '16 at 15:52

Matek

641
5
16

R mlr - Creating learning curve from subset of training data and whole test data (not whole training data)?

2 Answers2