Parallelizing code for matching between two data frames

Question

I'm trying to parallelize a function that works roughly like this:

process_row <- function(i, db1, db2){
    row <- db1[,i]
    match_db <- db2 %>% dplyr::filter(name_clear == row$name)
    out <- cbind(row, match_db)
}

out_list <- lapply(1:length(db1), process_row, db1=db1, db2=db2)

In reality it's more complicated but the basic idea is that it goes row-by-row through db1, finds appropriate matching rows in db2 according to a set of rules, and then binds them together column-wise (one-to-many relationship). In terms of size, db1 has 40k rows, db2 has 6 million. The matches (out) are usually between 1 and 100 rows long.

When I run the regular lapply, it takes up very little CPU & memory. I can "manually" paralellize it by opening up multiple instances of Rstudio and changing the range of the lapply to 1:10000 in the first one, 10001:20000 in the second, etc. but that's ridiculous. Unfortunately it's the only "working" solution I found so far. Neither foreach nor future.apply work - they just crash after running out of memory completely, even on a range of 1:10.

I suspect this has to do with the way I wrote the function. How should I change it?

This sounds like a `dplyr::left_join(db1, db2, by = c(name = "name_clear")` or similar; it would help if you presented a simple reproducible example... — Martin Morgan, Feb 16 '22 at 19:46
@MartinMorgan the process is more complicated, the matching is done on regexes so won't work with regular joins — Spine Feast, Feb 16 '22 at 19:50
Please add a simple reproducible example (e.g., db1 with 2 rows, db2 with 3 rows?), including desired output; the role of regular expressions is not at all obvious in your question, and as written you iterate over columns, not rows... — Martin Morgan, Feb 16 '22 at 19:55
@Dave2e doesn't work with big tables, "failed to allocate a vector of size 600GB" - lol — Spine Feast, Feb 17 '22 at 00:29

Martin Morgan · Answer 1 · 2022-02-18T16:55:31.833

Here's a familiar dataset for db1

library(dplyr)
db1 = as_tibble(mtcars, rownames = "model")

From your previous post, I'm guessing that you might have something analogous to a tibble of 'brands' and a corresponding regular expression to extract brands that have models associated with them.

db2 <-
    db1 |>
    mutate(
        brand = sub(" .*", "", model),
        brand_re = paste0("^(?i)", brand, "( .*)$")
    ) |>
    select(brand, brand_re) |>
    distinct()

and you'd like to do the equivalent of

fuzzyjoin::regex_right_join(db1, db2, by = c(model = "brand_re")) |>
    ## just a few columns to illustrate
    select(brand, model, mpg, cyl, disp, drat, wt)

The problem with using fuzzyjoin for your case is that it tries to be efficient by making two parallel vectors that capture all combinations of unique regular expressions and unique values to pass to the stringi::stri_detect() function, but of course with 45k regular expressions and 6 million values, those are very long vectors!

The challenge with your approach to parallelization is that too much data is used in the lapply(), and the operations (repeatedly subsetting a very large data.frame and binding columns) are not efficient.

Instead, write a function that performs the necessary computation with only the necessary data. Here we iterate over the patterns of interest, and return a list of integer vectors indexing the str (model in our case) matching the corresponding regular expression.

fuzzy_match <- function(str, patterns) {
    lapply(patterns, function(pattern, str) {
        which(stringi::stri_detect_regex(str, pattern))
    }, str)
}
joins <- fuzzy_match(db1$model, db2$brand_re)
str(head(joins))

For instance, the first element of joins says that the first regular expression of db2$brand_re matches elements 1 and 2 of db1$model.

We can then construct the joint table by constructing relevant indexes and binding rows

db1_idx <- unlist(joins)
db2_idx <- rep(seq_along(joins), lengths(joins))
## FIXME: not quite a join, since regular expressions in db2 with no
## matches in db1 are not present
cbind(db2[db2_idx,], db1[db1_idx,]) |>
    as_tibble() |>
    select(brand, model, mpg, cyl, disp, drat, wt)

When str is of length 5 million, this seems to take about 1 second per regular expression, so 45k regular expressions is still a long time. The first step is to ask whether there are improvements to the naive implementation of fuzzy_match() that would speed it up. For instance if the str contains duplicates, it might be beneficial to only do the match on unique values.

fuzzy_match_with_duplicates <- function(str, patterns) {
    ## only iterate on unique values; useful if many duplicates
    ustr <- unique(str)
    uresult <- lapply(patterns, function(pattern, str) {
        which(stringi::stri_detect_regex(str, pattern))
    }, ustr)
    ## unpack ustr
    lapply(uresult, function(i) which(str %in% ustr[i]))
}

It might also pay to study the regular expressions and eliminate redundancies, for instance in the present case we could obtain relevant results (expect for models without brands) with single pass

db1 |>
    mutate(brand = sub(" .*", "", model))

In the example of your previous post it seems like one could make progress by replacing specific patterns with SMITH etc. with [[:alpha:]]+ or similar.

If the function is still slow, the goal is now to parallelize a simple lapply() with modest data. On Linux or macOS the easiest solution is to use parallel::mclapply()

## save some cores for other things...
options(mc.cores = parallel::detectCores() - 2L)
fuzzy_match_parallel_mclapply <- function(str, patterns) {
    mcapply(patterns, function(pattern, str) {
        which(stringi::stri_detect_regex(str, pattern))
    }, str)
}

On Windows the easiest approach is to use parallel::parLapply() with a socket-based back end

fuzzy_match_parallel_sockets <- function(str, patterns) {
    parLapply(cl, patterns, function(pattern, str) {
        which(stringi::stri_detect_regex(str, pattern))
    }, str)
}

cl <- makePSOCKcluster(detectCores() - 2L)
joins <- fuzzy_match_parallel_sockets(db1$model, db2$brand_re)
stopCluster(cl)

Subsequent processing steps are unchanged.

Parallelizing code for matching between two data frames

1 Answers1