0

I am trying to perform some matching that rests on the implementation of a one-to-many left_join. The issue is that -even running the whole thing using cluster computing - the basic match produce a dataset too big to handle.

I get this error:

#Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
#Join results in more than 2^31 rows (internal vecseq reached physical limit). 
#Very likely misspecified join. 
#Check for duplicate key values in i each of which join to the same group in x over and over again. 
#If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. 
#Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

#Calls: polbur_match_nb ... dt_eval -> eval_tidy -> [ -> [.data.table -> vecseq

The join is not per se misspecified but I need to find a way around doing it all at once. I was trying to divide the left db into more reasonable chunks (by state) and run my_matching_function for each one.

However, this is extremely slow even with a micro sample database for both p and b. What should I do to speed up the whole thing? Is there anything I can improve in the code

Apologies for the lack of reproducible example

match_bystate <- function(r,p,b){
  
  gc()
  states <- sort(unique(p$state))
  
  matches_list = list()
  for(i in 1:length(states)){
    p_state <- p %>% 
      filter(state==states[i])
    matches_final <- my_matching_function(r,p,b)
    matches_list[[i]] <- matches_final
  }
  
  final = do.call(rbind, matches_list)
  saveRDS(final,file=paste0("../",gsub("-","",str_sub(Sys.time(),1,10)),"_match_",r,".RDS"))
  
}

This is a simplified version of my my_matching_function:

my_matching_function <- function(r,p,b){

  p_original <- p
  
  p <- p %>% 
    dplyr::select(id,city,lastname1,lastname2) 

  b <- b %>% 
    dplyr::select(city,lastname1,lastname2) %>% 
    dplyr::rename("lastname1_match"="lastname1",
                  "lastname2_match"="lastname2") 

  matches <- p %>% 
    data.table::data.table() %>%  
    lazy_dt(immutable = FALSE) %>% 
    dplyr::left_join(b, by = "city") %>% 
 dplyr::mutate(match_1=tidyr::replace_na(ifelse(lastname1==lastname1_match|lastname1==lastname2_match,1,0),0)) %>% 
    dplyr::mutate(match_2=tidyr::replace_na(ifelse(lastname2==lastname1_match|lastname2==lastname2_match,1,0),0)) %>%
    as.data.frame() %>% 
    dplyr::mutate(sum = rowSums(across(match_1:match_2))) %>%
    data.table::data.table() %>%  
    dplyr::mutate(final_1 = ifelse(sum>=1,1,0)) %>%
    dplyr::mutate(final_2 = ifelse(sum>=2,1,0)) %>% 
    group_by_at(c(names(p))) %>%
    dplyr::summarise(final_1 = sum(final_1),
                     final_2 = sum(final_2)) %>% 
    as.data.frame()
  Sys.sleep(60)
  
  matches_final <- p_original %>% 
    dplyr::left_join(matches) %>% 
    dplyr::mutate(raisyear=r)
  
  return(matches_final) 
}



MCS
  • 1,071
  • 9
  • 23
  • 2
    That's a lot of `dplyr::`s, why not just import `dplyr` with `library(dplyr)`? – U13-Forward Oct 29 '21 at 10:44
  • That's just to avoid function conflicts for example with tidyr – MCS Oct 29 '21 at 11:15
  • *That's just to avoid function conflicts for example with tidyr* What do you mean with that? What functions from tidyr do you think would be masked by dplyr? – dario Oct 29 '21 at 11:41
  • Is this a non-problem? Meaning would specify the library slow down my code or not? In any case, the masking was with respect to multidplyr, dtplyr and data.table, not tidyr. I just specify eanch time to keep the writing consistent. – MCS Oct 29 '21 at 11:51

0 Answers0