1

I have a large data.table that is getting pretty large (>1TB) and starting not to fit my server's RAM (~1TB of RAM).

The file has person and family identifiers and a large number of logical indicator variables (~120). The data is used to generate reports, being aggregated by data.table and dplyr functions.

For instance:

library(dplyr)
library(data.table)
n_obs=100000 # just for a manageable example, actual dataset has ~ 1.5e+09 rows
n_vl=120
number <- seq_len(length.out = n_vl)
names_var <- paste0("var_", 1:n_vl)
d <- data.table(
  person_id = 1:n_obs,
  family_id=(1:n_obs)/2 %>% floor 
  )
for(i in 1:length(names_var)){
  d[,names_var[i]] <- c(TRUE,FALSE) %>% sample(n_obs,TRUE)
}

# Aggregating

by_vars <- c('family_id', paste0("var_", 1:5) )
sum_vars <- paste0("var_", 6:n_vl)

d[,lapply(.SD, function(x) sum(x,na.rm = T)),
                by=by_vars,
                .SDcols=sum_vars] -> d_agg 

Before exploring out off memory solutions, I would like to reduce the memory footprint of the data. In particular, the logical variables take 4 bytes in R. Maybe some other variable type (bit64 package) could help.

What is the state of the art on this?

Is there any other tradeoff (memory use vs speed) in using bit (or some other form of compact boolean)?

Is the bit format supported by data.table and dplyr functions? Can it be summed, aggregated, etc?

Edit1: addded code to simulate the data Edit2: added code to simulate the aggregation process

LucasMation
  • 2,408
  • 2
  • 22
  • 45
  • 1
    Side note .. are there any character columns in your data? If so, try to convert them to `factor`. – talat Jan 27 '17 at 15:25
  • If the logical identifiers are putting entries into mutually exclusive categories, you might save space by converting to a single categorical var (a factor as docendo suggested). Also, if you have duplicated rows (with the exact same values in all columns), you could collapse the data to counts of rows and probably save quite a bit of space. It might help if you give a more concrete illustration (like some code that creates similar-enough data as a function of `n`, the number of rows, or some other scale parameters). – Frank Jan 27 '17 at 15:37
  • @docendodiscimus: no characters. Only numeric identifiers and logicals – LucasMation Jan 27 '17 at 15:56
  • @Frank : most categories are mutually exclusive (I'll doube cleck), there are no duplicates (already cleaned that up) – LucasMation Jan 27 '17 at 15:56
  • 1
    @docendodiscimus According to [Joshua Ulrich's answer](http://stackoverflow.com/a/13570765/3817004) _Converting to factor won't save space because characters are stored in a hash table._ – Uwe Jan 27 '17 at 16:36
  • @Uwe I may be wrong, but when I compare character and factor vectors like from Josh Obrian's answer to the linked question, it does look as if the factor vector is about half the size. – talat Jan 27 '17 at 16:58
  • @Frank: sorry, I mean to say the dummies are NOT mutually exclusive – LucasMation Jan 27 '17 at 18:17
  • Ok. Anyway, we voted to close the question until you can give a reproducible example to more specifically illustrate the problem. Besides the data, it may also help to know what sort of aggregation you need to do with it. – Frank Jan 27 '17 at 18:27
  • @Frank: just added a random dataset to ilustrate. I will add agg functions later – LucasMation Jan 27 '17 at 20:22
  • 1
    @docendodiscimus It seems you are right. I've repeated the tests with R3.3.2 for x86_64 (Windows 10 64-bit) and indeed, character vectors need twice as much RAM as factor vectors. That's quite interesting because even Hadly Wickham wrote in his [_Advanced R_ book](http://adv-r.had.co.nz/Data-structures.html#attributes): _In early versions of R, there was a memory advantage to using factors instead of character vectors, but this is no longer the case._ – Uwe Jan 28 '17 at 00:18
  • How many of these indicators do you need to _store_, as opposed to create on the fly? For example, if you have, say, a gender flag and a gender column, you can remove the gender flag and create it on-the-fly whenever you need it... – MichaelChirico Jan 28 '17 at 19:18
  • @MichaelChirico, all of them. I am not storing any variables that are redundant with the flags – LucasMation Jan 31 '17 at 12:50
  • Just added the aggregation code, as suggested – LucasMation Feb 01 '17 at 12:41
  • related question: http://stackoverflow.com/questions/17718326/save-storage-space-for-small-integers-or-factors-with-few-levels – eddi Feb 08 '17 at 20:17

0 Answers0