0

I have a classification task that I managed to train with mlr package using LDA ("classif.lda") in a few seconds. However when I trained it using "classif.rpart" the training never ended.

Is there any different setup to be done for the different methods?

My training data here if needed to replicate the problem. I tried to train it simply with

pred.bin.task <- makeClassifTask(id="CountyCrime", data=dftrain, target="count.bins")
train("classif.rpart", pred.bin.task)
Ricky
  • 4,616
  • 6
  • 42
  • 72

1 Answers1

0

In general, you don't need to change anything about the setup when switching learners -- one of the main points of mlr is to make this easy! This does not mean that it'll always work though, as different learning methods do different things under the hood.

It looks like in this particular case the model simply takes a long time to train, so you probably didn't wait long enough for it to complete. You have quite a large data frame.

Looking at your data, you seem to have an interval of values in count.bins. This is treated as a factor by R (i.e. intervals are only the same if the string matches completely), which is probably not what you want here. You could encode start and end as separate (numerical) features.

Lars Kotthoff
  • 107,425
  • 16
  • 204
  • 204
  • Thanks Lars for response. I don't think it's because I didn't wait long enough, because as I mentioned it finishes in lda in a few seconds, and I waited an hour for rpart. I ran a similar dataset (larger, the one I put here is a subset) on caret and rpart finishes a few seconds after lda, so I didn't expect it on mlr to have such a big time difference. lda on mlr was faster than lda on caret. – Ricky Feb 01 '16 at 01:44
  • Hmm, interesting. I've tried to run `rpart` directly and that also takes a long time. How exactly did you run caret? I've tried it with default arguments and it indeed finished very quickly, but it internally subsamples the data, so the models are trained on smaller subsets. – Lars Kotthoff Feb 01 '16 at 01:58
  • Very interesting. I didn't try training `rpart` directly, now that I tried it yes it takes a long time (I stopped it), but it was still fast in `caret`. I trained caret with `caret::train(count.bins ~ ., data = dftrain, method="rpart", trControl = fitControl)` where `fitControl <- trainControl( method = "repeatedcv", number = 5, repeats = 3, returnData = FALSE, verboseIter = TRUE )` – Ricky Feb 01 '16 at 08:57
  • Right, so that would train on a subset of the data only. In `mlr` you need a resample description for that, see the [quickstart](https://mlr-org.github.io/mlr-tutorial/devel/html). – Lars Kotthoff Feb 01 '16 at 17:22
  • Not sure what you meant "train on a subset of the data only"; are you referring to the cv in the `trControl`? If so, I actually compared with `mlr`'s `resample` and it hung, before trying just train. For completeness: I had `rdesc <- makeResampleDesc("RepCV", reps=3, folds=5)`, which I believe is same as the `fitControl` I have for `caret`. For rpart I used `resample("classif.rpart", pred.bin.task, rdesc)`, and the screen is stuck at `[Resample] repeated cross-validation iter: 1` . Compared to lda, `resample("classif.lda", pred.bin.task, rdesc)` completes 15 iteration in less than 10 seconds. – Ricky Feb 02 '16 at 02:54
  • 1
    I see. In any case, it seems that `rpart` doesn't like particular parts of the data (and you will probably get the same problem with caret if you take a different random seed, or different partitioning or something like that). – Lars Kotthoff Feb 02 '16 at 02:56