2

I read a large .csv file into a data.table using fread. In a next step, I conduct some data cleaning operations such as converting some character variables to dates.

Memory optimized solution

Here is my memory optimized solution in a MWE:

library(data.table)
library(foreach)

dt <- data.table(chr_date_1 = c("2017-01-01", "2017-01-02", "2017-01-03"),
                 chr_date_2 = c("2017-01-04", "2017-01-05", "2017-01-06"))

for (date_var in c("chr_date_1", "chr_date_2")) {
  dt %>% 
    set(j = date_var, value = ymd(.[[date_var]]))
}

Speed optimized solution?

Since there are 8.5 million observations and ~30 variables which should be manipulated in some way I was wondering if this could be optimized in terms of speed using foreach. (I run R on a Virtual Client with 128GB RAM, so memory is not the primary concern.) My first attempt looked like this:

registerDoParallel(cores=7)

foreach (date_var = names(dt), .packages = c("data.table")) %dopar% {
  set(dt, j = date_var, value = as.Date(dt[[date_var]]))
}

Unfortunately, this only prints the results in the console instead of updating dt:

[[1]]
chr_date_1 chr_date_2
1: 2017-01-01 2017-01-04
2: 2017-01-02 2017-01-05
3: 2017-01-03 2017-01-06

[[2]]
chr_date_1 chr_date_2
1: 2017-01-01 2017-01-04
2: 2017-01-02 2017-01-05
3: 2017-01-03 2017-01-06

Next, I tried to set up .combine and .init, ...

foreach (date_var = names(dt), .packages = c("data.table"),
         .combine = "cbind", .init = dt) %dopar% {
           set(dt, j = date_var, value = as.Date(dt[[date_var]]))
         }

... but there were new columns added instead of existing ones updated:

  chr_date_1 chr_date_2 chr_date_1 chr_date_2 chr_date_1 chr_date_2
  1: 2017-01-01 2017-01-04 2017-01-01 2017-01-04 2017-01-01 2017-01-04
  2: 2017-01-02 2017-01-05 2017-01-02 2017-01-05 2017-01-02 2017-01-05
  3: 2017-01-03 2017-01-06 2017-01-03 2017-01-06 2017-01-03 2017-01-06

So, is it possible at all to update several columns of a data.table in parallel using foreach?

If not, might there be a solution in which I create all new columns in parallel and the somehow merge them into the existing data.table? One aspect which prevented me from elaborating on this is the fact that the last foreach snipped yields the same (4 columns) when I omit .init = dt.


EDIT:

I found a way of getting the desired result, but in an example with 7 variables (= number of registered cores) and 3M observations, this took approximately 5 times longer than the memory optimized solution. So just for telling you, how not to do it:

result <- foreach (date_var = names(dt), .packages = c("data.table")) %dopar% {
    dt[, (date_var) := lapply(.SD, as.Date, format = "%Y-%m-%d"), .SDcols = date_var]
  } %>% .[[length(.)]]

Also, the needed amount of memory is insane.

Community
  • 1
  • 1
der_grund
  • 1,898
  • 20
  • 36
  • No, that's not possible. However, depending on your date format there are faster date parsers available in packages. – Roland Sep 27 '17 at 10:09
  • 1
    https://stackoverflow.com/questions/35247063/is-there-a-fast-parser-for-date – Roland Sep 27 '17 at 10:11
  • Thanks for pointing that out, @Roland. I was actually using `lubridate` due to convenience but replaced it in the MWE for simplicity. Your post made me aware of the fact that `ymd` is even (way!) slower than `as.Date`. – der_grund Sep 27 '17 at 11:35
  • 2
    You are just parallelizing by columns? I imagine there's some `DT[, (cols) := mclapply(.SD, as.Date), .SDcols=cols]` or similar. – Frank Sep 27 '17 at 16:34
  • Unfortunately, I run R on Windows here at work, so I can't increase `mc.cores`. Therefore, `mclapply` is slightly slower than the _memory optimized solution_ I described in my question. Anyway, thanks for the hint, I will try that at home on a linux machine. – der_grund Sep 28 '17 at 07:39

0 Answers0