0

I am combining dtplyr and multidplyr libraries to handle some basic mutate/summarise operations carried out on a very large db. final_db_partition, after merging is sometimes 30m lines long.

I cannot figure out if I am doing something wrong but the R session is aborted or I finish my memory.

R version 4.0.5 (2021-03-31) / Platform: x86_64-apple-darwin17.0 (64-bit) / Running under: macOS Big Sur 10.16

How should I tackle this issue?

library(multidplyr)
library(dtplyr)
library(dplyr, warn.conflicts = FALSE)
library(data.table)
library(stringr)


default_cluster(parallel::detectCores()-1)
cluster_library(default_cluster(), 'dplyr')
cluster_library(default_cluster(), 'stringr')

db1 <- db1 %>% 
    data.table::data.table() %>% 
    lazy_dt(immutable = FALSE) 
  
  db2 <- db2 %>% 
    data.table::data.table() %>% 
    lazy_dt(immutable = FALSE)  

final_db_partition <- db1 %>% 
    left_join(db2)  %>% 
    as.data.frame() %>% 
    group_by(id) %>% 
    partition(cluster = default_cluster()) 

final_db <- final_db_partition %>% 
    as.data.table() %>% 
    #lazy_dt(immutable = FALSE)  %>% 
    mutate(m1=ifelse(stringi::stri_detect_regex(m_destination, paste0("\\b", m_origin,"\\b")),1,0)) %>% 
    as.data.frame() %>% 
    group_by(across(c(-v1,-v2,-v3))) %>% 
    summarise(finalv1 = sum(finalv1,na.rm=T),
              finalv2 = sum(finalv2,na.rm=T)) %>% 
    collect() 


MCS
  • 1,071
  • 9
  • 23
  • Are you just running out of RAM? – heds1 Jul 29 '21 at 08:43
  • R runs into a fatal error and forcefully terminate the script without providing any further information. Some other time the reference to memory is explicit. – MCS Jul 29 '21 at 09:23
  • Right, sounds like you're running out of RAM if you're seeing that reference to memory. 30 million rows will require quite a lot of memory. – heds1 Jul 29 '21 at 09:49
  • Is there a way to allocate more RAM or by default all usable RAM is employed? – MCS Jul 29 '21 at 12:30
  • I am quite without ideas. I am resorting to do the left_merge within a for loop... – MCS Jul 29 '21 at 14:42
  • It looks like the grouping and summarization is happening outside of `data.table`/`dtplyr` (I think, I'm not a big `tidyverse` user); unless that's necessary for some reason, that could be a place that is using more memory than strictly needed. More generally for troubleshooting, it may be worth splitting up that last line to figure out exactly where you're running out of memory. Your title implies that you've isolated the mutate being the issue but then you mention an issue during `left_merge` in a comment? – ClancyStats Jul 30 '21 at 12:59
  • Using data.table directly should take up less memory and be faster. If you can provide an example of db1 and db2 with a few rows of (made up) data, I will attempt to provide an answer using data.table. – dnlbrky Dec 11 '21 at 18:40

0 Answers0