I have a 2.2 Million row dataset. RandomForest throws an error if I have a training data set with more than 1 000 000 rows. So I split the data sets in two pieces and the models learn seperately. How do I combine() the models so I can make a predicition with both of their knowledge?
rtask <- makeClassifTask(data=Originaldaten,target="geklaut")
set.seed(1)
ho = makeResampleInstance("CV",task=rtask, iters = 20)
rtask.train = subsetTask(rtask, subset = 1:1000000)
rtask.train2 = subsetTask(rtask, subset = 1000001:2000000)
rtask.test = subsetTask(rtask, subset = 2000000:2227502)
rlearn_lm <- makeWeightedClassesWrapper(makeLearner("classif.randomForest"), wcw.weight = 0.1209123724417812)
param_lm <- makeParamSet(
makeIntegerParam("ntree", 500, 500),
makeLogicalParam("norm.votes", FALSE, FALSE),
makeLogicalParam("importance", TRUE, TRUE),
makeIntegerParam("maxnodes" ,4,4)
)
tune_lm <- tuneParams(rlearn_lm,
rtask.train,
cv5, #kreuzvalidierung 5-fach
mmce, #fehler
param_lm,
makeTuneControlGrid(resolution=5)) #wertebereiche
rlearn_lm <- setHyperPars(rlearn_lm,par.vals = tune_lm$x)
model_lm <- train(rlearn_lm,rtask.train)
model_lm2 <- train(rlearn_lm,rtask.train2)
modelGesamt <- combine(model_lm$,model_lm2)
EDIT
you guys are right. actually reading my own code helped me a lot. I have a working resampling here for anyone interested in the future
ho = makeResampleInstance("CV",task=rtask, iters = 20)
rtask.train = subsetTask(rtask,ho$train.inds[[1]])
rtask.test = subsetTask(rtask,ho$test.inds[[1]] )