I created several ctree models (about 40 to 80) which I want evaluate rather often.
An issue is that the model objects are very big (40 models require more than 2.8G of memory) and it appears to me, that they stored the training data, maybe as modelname@data and modelname@responses, and not just the informations relevant to predict new data.
Most other R learning packages have configurable options whether to include the data in the model object, but I couldn't find any hints in the documentation. I also tried to assign empty ModelEnv objects by
modelname@data <- new("ModelEnv")
but there was no effect on the size of the respective RData file.
Anyone knows whether ctree really stores the training data and how to remove all data from ctree models that are irrelevant for new predictions so that I can fit many of them in memory?
Thanks a lot,
Stefan
Thank you for your feedback, that was already very helpful.
I used dput
and str
to take a deeper look at the object and found that no training data is included in the model, but there is a responses
slot, which seems to have the training labels and rownames. Anyways, I noticed that each node has a weight vector for each training sample. After a while of inspecting the code, I ended up googling a bit and found the following comment in the party
NEWS log:
CHANGES IN party VERSION 0.9-13 (2007-07-23)
o update `mvt.f'
o improve the memory footprint of RandomForest objects
substancially (by removing the weights slots from each node).
It turns out, there is a C function in the party package to remove these weights called R_remove_weights
with the following definition:
SEXP R_remove_weights(SEXP subtree, SEXP removestats) {
C_remove_weights(subtree, LOGICAL(removestats)[0]);
return(R_NilValue);
}
It also works fine:
# cc is my model object
sum(unlist(lapply(slotNames(cc), function (x) object.size(slot(cc, x)))))
# returns: [1] 2521256
save(cc, file="cc_before.RData")
.Call("R_remove_weights", cc@tree, TRUE, PACKAGE="party")
# returns NULL and removes weights and node statistics
sum(unlist(lapply(slotNames(cc), function (x) object.size(slot(cc, x)))))
# returns: [1] 1521392
save(cc, file="cc_after.RData")
As you can see, it reduces the object size substantially, from roughly 2.5MB to 1.5MB.
What is strange, though, is that the corresponding RData files are insanely huge, and there is no impact on them:
$ ls -lh cc*
-rw-r--r-- 1 user user 9.6M Aug 24 15:44 cc_after.RData
-rw-r--r-- 1 user user 9.6M Aug 24 15:43 cc_before.RData
Unzipping the file shows the 2.5MB object to occupy nearly 100MB of space:
$ cp cc_before.RData cc_before.gz
$ gunzip cc_before.gz
$ ls -lh cc_before*
-rw-r--r-- 1 user user 98M Aug 24 15:45 cc_before
Any ideas, what could cause this?