Combine dtplyr and multidplyr to deal with large mutate operation

Question

I am combining dtplyr and multidplyr libraries to handle some basic mutate/summarise operations carried out on a very large db. final_db_partition, after merging is sometimes 30m lines long.

I cannot figure out if I am doing something wrong but the R session is aborted or I finish my memory.

R version 4.0.5 (2021-03-31) / Platform: x86_64-apple-darwin17.0 (64-bit) / Running under: macOS Big Sur 10.16

How should I tackle this issue?

library(multidplyr)
library(dtplyr)
library(dplyr, warn.conflicts = FALSE)
library(data.table)
library(stringr)


default_cluster(parallel::detectCores()-1)
cluster_library(default_cluster(), 'dplyr')
cluster_library(default_cluster(), 'stringr')

db1 <- db1 %>% 
    data.table::data.table() %>% 
    lazy_dt(immutable = FALSE) 
  
  db2 <- db2 %>% 
    data.table::data.table() %>% 
    lazy_dt(immutable = FALSE)  

final_db_partition <- db1 %>% 
    left_join(db2)  %>% 
    as.data.frame() %>% 
    group_by(id) %>% 
    partition(cluster = default_cluster()) 

final_db <- final_db_partition %>% 
    as.data.table() %>% 
    #lazy_dt(immutable = FALSE)  %>% 
    mutate(m1=ifelse(stringi::stri_detect_regex(m_destination, paste0("\\b", m_origin,"\\b")),1,0)) %>% 
    as.data.frame() %>% 
    group_by(across(c(-v1,-v2,-v3))) %>% 
    summarise(finalv1 = sum(finalv1,na.rm=T),
              finalv2 = sum(finalv2,na.rm=T)) %>% 
    collect()

R runs into a fatal error and forcefully terminate the script without providing any further information. Some other time the reference to memory is explicit. — MCS, Jul 29 '21 at 09:23
Right, sounds like you're running out of RAM if you're seeing that reference to memory. 30 million rows will require quite a lot of memory. — heds1, Jul 29 '21 at 09:49
Is there a way to allocate more RAM or by default all usable RAM is employed? — MCS, Jul 29 '21 at 12:30
I am quite without ideas. I am resorting to do the left_merge within a for loop... — MCS, Jul 29 '21 at 14:42
It looks like the grouping and summarization is happening outside of `data.table`/`dtplyr` (I think, I'm not a big `tidyverse` user); unless that's necessary for some reason, that could be a place that is using more memory than strictly needed. More generally for troubleshooting, it may be worth splitting up that last line to figure out exactly where you're running out of memory. Your title implies that you've isolated the mutate being the issue but then you mention an issue during `left_merge` in a comment? — ClancyStats, Jul 30 '21 at 12:59
Using data.table directly should take up less memory and be faster. If you can provide an example of db1 and db2 with a few rows of (made up) data, I will attempt to provide an answer using data.table. — dnlbrky, Dec 11 '21 at 18:40

Combine dtplyr and multidplyr to deal with large mutate operation

0 Answers0