How to combine factor levels from two empty data.frames?
I have a big data set splitted into separate files. I need a data.frame that will have all possible levels for factor columns, but I can't load all parts at once, only part by part.
Is there a way to do something like:
data_structure = NULL
for (chunk_i in chunks){
# load chunk_i data
if(is.null(data_structure)){
data_structure = data_i
} else {
# at this line factor levels will NOT be combined as I expect
# but instead factor levels from 'data' will be stored to 'data_structure'
data_structure = rbind(data_structure, data)
}
rm(data)
# empty data frame, since I can't keep all data in memory
# I want to keep only metadata, like factor levels
data_structure = data_structure[0, ]
}
And this data_structure is needed to later convert factors to binary columns like this:
result_i = model.matrix(~ . + 0, data=data_i, contrasts.arg =
lapply(data_structure, contrasts, contrasts=FALSE))
If factor levels a gathered from all parts of data then I can be sure that result_i will have exactly same binary columns as all other parts of data, even if in this particular case data_i have less factor levels in some columns.
UPDATE
Right now I use this solution:
all_levels = list()
for_each_chunk(function(data) {
data_levels = Filter(Negate(is.null), sapply(data, levels))
factor_names = unique(c(names(all_levels), names(data_levels)))
lapply(factor_names, FUN=function(name){
all_levels[[name]] <<- unique(c(all_levels[[name]], data_levels[[name]]))
})
})
Not so elegant as for me, but haven't found nothing better yet.