I have a large data.table that is getting pretty large (>1TB) and starting not to fit my server's RAM (~1TB of RAM).
The file has person and family identifiers and a large number of logical indicator variables (~120). The data is used to generate reports, being aggregated by data.table and dplyr functions.
For instance:
library(dplyr)
library(data.table)
n_obs=100000 # just for a manageable example, actual dataset has ~ 1.5e+09 rows
n_vl=120
number <- seq_len(length.out = n_vl)
names_var <- paste0("var_", 1:n_vl)
d <- data.table(
person_id = 1:n_obs,
family_id=(1:n_obs)/2 %>% floor
)
for(i in 1:length(names_var)){
d[,names_var[i]] <- c(TRUE,FALSE) %>% sample(n_obs,TRUE)
}
# Aggregating
by_vars <- c('family_id', paste0("var_", 1:5) )
sum_vars <- paste0("var_", 6:n_vl)
d[,lapply(.SD, function(x) sum(x,na.rm = T)),
by=by_vars,
.SDcols=sum_vars] -> d_agg
Before exploring out off memory solutions, I would like to reduce the memory footprint of the data. In particular, the logical variables take 4 bytes in R. Maybe some other variable type (bit64
package) could help.
What is the state of the art on this?
Is there any other tradeoff (memory use vs speed) in using bit
(or some other form of compact boolean)?
Is the bit format supported by data.table and dplyr functions? Can it be summed, aggregated, etc?
Edit1: addded code to simulate the data Edit2: added code to simulate the aggregation process