I read a large .csv
file into a data.table
using fread
. In a next step, I conduct some data cleaning operations such as converting some character variables to dates.
Memory optimized solution
Here is my memory optimized solution in a MWE:
library(data.table)
library(foreach)
dt <- data.table(chr_date_1 = c("2017-01-01", "2017-01-02", "2017-01-03"),
chr_date_2 = c("2017-01-04", "2017-01-05", "2017-01-06"))
for (date_var in c("chr_date_1", "chr_date_2")) {
dt %>%
set(j = date_var, value = ymd(.[[date_var]]))
}
Speed optimized solution?
Since there are 8.5 million observations and ~30 variables which should be manipulated in some way I was wondering if this could be optimized in terms of speed using foreach
. (I run R on a Virtual Client with 128GB RAM, so memory is not the primary concern.) My first attempt looked like this:
registerDoParallel(cores=7)
foreach (date_var = names(dt), .packages = c("data.table")) %dopar% {
set(dt, j = date_var, value = as.Date(dt[[date_var]]))
}
Unfortunately, this only prints the results in the console instead of updating dt
:
[[1]]
chr_date_1 chr_date_2
1: 2017-01-01 2017-01-04
2: 2017-01-02 2017-01-05
3: 2017-01-03 2017-01-06
[[2]]
chr_date_1 chr_date_2
1: 2017-01-01 2017-01-04
2: 2017-01-02 2017-01-05
3: 2017-01-03 2017-01-06
Next, I tried to set up .combine
and .init
, ...
foreach (date_var = names(dt), .packages = c("data.table"),
.combine = "cbind", .init = dt) %dopar% {
set(dt, j = date_var, value = as.Date(dt[[date_var]]))
}
... but there were new columns added instead of existing ones updated:
chr_date_1 chr_date_2 chr_date_1 chr_date_2 chr_date_1 chr_date_2
1: 2017-01-01 2017-01-04 2017-01-01 2017-01-04 2017-01-01 2017-01-04
2: 2017-01-02 2017-01-05 2017-01-02 2017-01-05 2017-01-02 2017-01-05
3: 2017-01-03 2017-01-06 2017-01-03 2017-01-06 2017-01-03 2017-01-06
So, is it possible at all to update several columns of a data.table
in parallel using foreach
?
If not, might there be a solution in which I create all new columns in parallel and the somehow merge them into the existing data.table
? One aspect which prevented me from elaborating on this is the fact that the last foreach
snipped yields the same (4 columns) when I omit .init = dt
.
EDIT:
I found a way of getting the desired result, but in an example with 7 variables (= number of registered cores) and 3M observations, this took approximately 5 times longer than the memory optimized solution. So just for telling you, how not to do it:
result <- foreach (date_var = names(dt), .packages = c("data.table")) %dopar% {
dt[, (date_var) := lapply(.SD, as.Date, format = "%Y-%m-%d"), .SDcols = date_var]
} %>% .[[length(.)]]
Also, the needed amount of memory is insane.