I am trying to perform some matching that rests on the implementation of a one-to-many left_join
.
The issue is that -even running the whole thing using cluster computing - the basic match produce a dataset too big to handle.
I get this error:
#Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
#Join results in more than 2^31 rows (internal vecseq reached physical limit).
#Very likely misspecified join.
#Check for duplicate key values in i each of which join to the same group in x over and over again.
#If that's ok, try by=.EACHI to run j for each group to avoid the large allocation.
#Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.
#Calls: polbur_match_nb ... dt_eval -> eval_tidy -> [ -> [.data.table -> vecseq
The join is not per se misspecified but I need to find a way around doing it all at once. I was trying to divide the left db into more reasonable chunks (by state) and run my_matching_function
for each one.
However, this is extremely slow even with a micro sample database for both p and b. What should I do to speed up the whole thing? Is there anything I can improve in the code
Apologies for the lack of reproducible example
match_bystate <- function(r,p,b){
gc()
states <- sort(unique(p$state))
matches_list = list()
for(i in 1:length(states)){
p_state <- p %>%
filter(state==states[i])
matches_final <- my_matching_function(r,p,b)
matches_list[[i]] <- matches_final
}
final = do.call(rbind, matches_list)
saveRDS(final,file=paste0("../",gsub("-","",str_sub(Sys.time(),1,10)),"_match_",r,".RDS"))
}
This is a simplified version of my my_matching_function
:
my_matching_function <- function(r,p,b){
p_original <- p
p <- p %>%
dplyr::select(id,city,lastname1,lastname2)
b <- b %>%
dplyr::select(city,lastname1,lastname2) %>%
dplyr::rename("lastname1_match"="lastname1",
"lastname2_match"="lastname2")
matches <- p %>%
data.table::data.table() %>%
lazy_dt(immutable = FALSE) %>%
dplyr::left_join(b, by = "city") %>%
dplyr::mutate(match_1=tidyr::replace_na(ifelse(lastname1==lastname1_match|lastname1==lastname2_match,1,0),0)) %>%
dplyr::mutate(match_2=tidyr::replace_na(ifelse(lastname2==lastname1_match|lastname2==lastname2_match,1,0),0)) %>%
as.data.frame() %>%
dplyr::mutate(sum = rowSums(across(match_1:match_2))) %>%
data.table::data.table() %>%
dplyr::mutate(final_1 = ifelse(sum>=1,1,0)) %>%
dplyr::mutate(final_2 = ifelse(sum>=2,1,0)) %>%
group_by_at(c(names(p))) %>%
dplyr::summarise(final_1 = sum(final_1),
final_2 = sum(final_2)) %>%
as.data.frame()
Sys.sleep(60)
matches_final <- p_original %>%
dplyr::left_join(matches) %>%
dplyr::mutate(raisyear=r)
return(matches_final)
}