How to deal with memory issure in Ctree in party package?

Question

I am using ctree method of the Party R package to generate a decision tree.

My dataset has about 22 columns and 650000 rows of data. I allocated 10GB of memory to my r session using memory.limit command.

I have a 2.3 GHz i3 processor and 6GB of RAM. what am i doing wrong here.

the error i get is

Calloc could not allocate memory (6223507 of 8 bytes)

If you only have 6GB RAM, allocating 10GB won't help in any way. Also, memory allocation problems in `ctree()` are usually caused by `factor` variables with too many unque levels. You have two solutions, either buckets them and hence reduce the number of unique levels, or weight every unique path and use `weights` in `ctree()` — David Arenburg, Mar 25 '14 at 14:30
I have modified my dataset to include columns with atmost 5 factor levels.. I still have about 20 columns and 6million rows.. can my computer do this job? — apTNow, Mar 26 '14 at 22:42
Give it a try. What is your explained variable btw? Is it binary or continious? Because if it's binary, you can weight the tree per unique path and significantly reduce the size of it — David Arenburg, Mar 26 '14 at 23:00
its a character vector with 3 levels.. still running out of memory.. — apTNow, Mar 31 '14 at 09:24
I don't think can receive a character vector as explained variable. Only numeric , integer or factor — David Arenburg, Mar 31 '14 at 09:27
when i used str(myclass) , it showed me as a factor with 3 levels.. — apTNow, Mar 31 '14 at 09:33
So its a factor then. Ok, let me write you a code to wieght your data and significantly reduce the size of it. You should come here more often btw to check for answers — David Arenburg, Mar 31 '14 at 09:34
sure I will.. I cannot thank you enough for your help. Actually I am in college right now.. — apTNow, Apr 01 '14 at 11:20

David Arenburg · Answer 1 · 2014-04-05T23:36:17.587

Ok, I finally found some time to do this. It's not the too elegant, but should work. At first, load the libraries and the function below (you'll need to install data.table package)

library(data.table)
library(party)

WeightFunc <- function(data, DV){
# Creating some paste function in order to paste unique paths
paste2 <- function(x) paste(x, collapse = ",")
ignore <- DV

# Creating unique paths
test3 <- apply(data[setdiff(names(data),ignore)], 1, paste2)

# Binding the unique paths vector back to the original data
data <- cbind(data, test3)
#data

# Getting the values of each explaining variable per each unique path
dt <- data.table(data[setdiff(names(data), ignore)])
dt.out <- as.data.frame(dt[, head(.SD, 1), by = test3])

# Creating dummy variables per each value of our dependable variable for further calculations
DVLvs <- as.character(unique(data[, DV]))
data[, DVLvs[1]] <- ifelse(data[, DV] == DVLvs[1], 1, 0)
data[, DVLvs[2]] <- ifelse(data[, DV] == DVLvs[2], 1, 0)
data[, DVLvs[3]] <- ifelse(data[, DV] == DVLvs[3], 1, 0)

# Summing dummy variables per unique path
dt <- data.table(data[c("test3", DVLvs)])
dt.out2 <- as.data.frame(dt[, lapply(.SD, sum), by = test3])

# Binding unique pathes with sums
dt.out2$test3 <- dt.out$test3 <- NULL
test <- cbind(dt.out, dt.out2)

# Duplicating the data in order to create a weights for every level of expalined variable
test2 <- test[rep(1:nrow(test),each = 3), ]  
test2 <- cbind(test2, AdjDV = DVLvs)
test2$Weights <- ifelse(is.element(seq(1:nrow(test2)), grep("[.]1", rownames(test2))), test2[, DVLvs[2]], 
                        ifelse(is.element(seq(1:nrow(test2)), grep("[.]2",rownames(test2))), test2[, DVLvs[3]], test2[, DVLvs[1]]))

# Deleting unseassery column
test2[, DVLvs[1]] <- test2[, DVLvs[2]] <- test2[, DVLvs[3]] <- NULL

return(test2)
}

Now run this function on your data set where data is your data and DV is your explained variable name (in quotes) and save it in a new dataset, for example:

Newdata <- WeightFunc(data = Mydata, DV = "Success")

Now, this process could take a while if you have many unique pathes, but it shouldn't overload your memory. If you don't have too many unique paths, this function should reduce your data set by tens or even hundred times. Also, this function is only good for 3 level factor explained variable (like you have).

After that, you can run the ctree as you were doing previously, but with the new data and the new explained variable (which will be called AdjDV) and wiegths parameter which called Weights. You'll also have to exclude Weights out of the dataset while running the ctree. Like that:

ct <- ctree(AdjDV ~., data = Newdata[setdiff(names(Newdata), "Weights")], weights = Newdata$Weights)

How to deal with memory issure in Ctree in party package?

1 Answers1