1

I have a 2.2 Million row dataset. RandomForest throws an error if I have a training data set with more than 1 000 000 rows. So I split the data sets in two pieces and the models learn seperately. How do I combine() the models so I can make a predicition with both of their knowledge?

rtask <- makeClassifTask(data=Originaldaten,target="geklaut")
set.seed(1)

ho = makeResampleInstance("CV",task=rtask, iters = 20)
rtask.train = subsetTask(rtask, subset = 1:1000000)
rtask.train2 = subsetTask(rtask, subset = 1000001:2000000)
rtask.test = subsetTask(rtask, subset = 2000000:2227502)

rlearn_lm <- makeWeightedClassesWrapper(makeLearner("classif.randomForest"), wcw.weight = 0.1209123724417812)

param_lm <- makeParamSet(
  makeIntegerParam("ntree", 500, 500),
  makeLogicalParam("norm.votes", FALSE, FALSE),
  makeLogicalParam("importance", TRUE, TRUE),
  makeIntegerParam("maxnodes" ,4,4)
)

tune_lm <- tuneParams(rlearn_lm,
                  rtask.train,
                  cv5,  #kreuzvalidierung 5-fach
                  mmce, #fehler
                  param_lm, 
                  makeTuneControlGrid(resolution=5)) #wertebereiche

rlearn_lm <- setHyperPars(rlearn_lm,par.vals = tune_lm$x)

model_lm <- train(rlearn_lm,rtask.train)
model_lm2 <- train(rlearn_lm,rtask.train2)
modelGesamt <- combine(model_lm$,model_lm2)

EDIT

you guys are right. actually reading my own code helped me a lot. I have a working resampling here for anyone interested in the future

ho = makeResampleInstance("CV",task=rtask, iters = 20)  
rtask.train = subsetTask(rtask,ho$train.inds[[1]])
rtask.test = subsetTask(rtask,ho$test.inds[[1]] )
  • The bigRF package may solve your problem . – Mohamed Desouky Jun 14 '22 at 16:22
  • I suggest that instead of trying to train 2M rows or splitting the data by row number, try using a random partition of the data. How many variables do you have? How accurate is your model with one sample? If you have many variables, It would probably be better to find which variables are influencing the model the most and work towards optimization that way. – Kat Jun 14 '22 at 18:13
  • @JonSpring Ensemble models don't help here, you would still combine models that were trained on different datasets and hence have no relation to each other. Why would you merge such? FWIW and given that OP asked for mlr, here's the mlr3 solution for ensemble models: https://mlr3book.mlr-org.com/05-pipelines-non-linear.html#pipe-model-ensembles. – pat-s Jun 14 '22 at 20:17

1 Answers1

1

This is not possible and you should also not do it. Train one model, even if it takes longer.

Models can't be merged to fusion their knowledge if they were trained on different datasets.

pat-s
  • 5,992
  • 1
  • 32
  • 60
  • One possibility would be to put both models in an ensemble and then take the majority of the two predictions or stack another learner on top. `mlr` supports both of these options. – Lars Kotthoff Jun 18 '22 at 14:07
  • How would you take the majority of n=2? And these two learners would still be trained on different datasets, hence the outcome cannot be relied on, can they? – pat-s Jun 20 '22 at 06:32
  • For two base learners you'd probably want to do stacking. And I'm assuming that the datasets would come from the same underlying distribution, i.e. the usual i.i.d. assumption. – Lars Kotthoff Jun 20 '22 at 10:41
  • I think it's still better to say "don't do that" as otherwise it will lead people into very dangerous territory because most people thinking about this way do most often not understand what happens behind the scenes and how to interpret/handle the results etc. – pat-s Jun 21 '22 at 06:29