Here's a familiar dataset for db1
library(dplyr)
db1 = as_tibble(mtcars, rownames = "model")
From your previous post, I'm guessing that you might have something analogous to a tibble
of 'brands' and a corresponding regular expression to extract brands
that have models associated with them.
db2 <-
db1 |>
mutate(
brand = sub(" .*", "", model),
brand_re = paste0("^(?i)", brand, "( .*)$")
) |>
select(brand, brand_re) |>
distinct()
and you'd like to do the equivalent of
fuzzyjoin::regex_right_join(db1, db2, by = c(model = "brand_re")) |>
## just a few columns to illustrate
select(brand, model, mpg, cyl, disp, drat, wt)
The problem with using fuzzyjoin for your case is that it tries to be
efficient by making two parallel vectors that capture all combinations
of unique regular expressions and unique values to pass to the
stringi::stri_detect()
function, but of course with 45k regular
expressions and 6 million values, those are very long vectors!
The challenge with your approach to parallelization is that too much data
is used in the lapply()
, and the operations (repeatedly subsetting a
very large data.frame and binding columns) are not efficient.
Instead, write a function that performs the necessary computation with
only the necessary data. Here we iterate over the patterns of
interest, and return a list of integer vectors indexing the str
(model
in our case) matching the corresponding regular expression.
fuzzy_match <- function(str, patterns) {
lapply(patterns, function(pattern, str) {
which(stringi::stri_detect_regex(str, pattern))
}, str)
}
joins <- fuzzy_match(db1$model, db2$brand_re)
str(head(joins))
For instance, the first element of joins
says that the first regular
expression of db2$brand_re
matches elements 1 and 2 of db1$model
.
We can then construct the joint table by constructing relevant indexes
and binding rows
db1_idx <- unlist(joins)
db2_idx <- rep(seq_along(joins), lengths(joins))
## FIXME: not quite a join, since regular expressions in db2 with no
## matches in db1 are not present
cbind(db2[db2_idx,], db1[db1_idx,]) |>
as_tibble() |>
select(brand, model, mpg, cyl, disp, drat, wt)
When str
is of length 5 million, this seems to take about 1 second
per regular expression, so 45k regular expressions is still a long time. The first step is to
ask whether there are improvements to the naive implementation of
fuzzy_match()
that would speed it up. For instance if the str
contains duplicates, it might be beneficial to only do the match on
unique values.
fuzzy_match_with_duplicates <- function(str, patterns) {
## only iterate on unique values; useful if many duplicates
ustr <- unique(str)
uresult <- lapply(patterns, function(pattern, str) {
which(stringi::stri_detect_regex(str, pattern))
}, ustr)
## unpack ustr
lapply(uresult, function(i) which(str %in% ustr[i]))
}
It might also pay to study the regular expressions and eliminate
redundancies, for instance in the present case we could obtain relevant
results (expect for models without brands) with single pass
db1 |>
mutate(brand = sub(" .*", "", model))
In the example of your previous post it seems like one could make progress by replacing specific patterns with SMITH
etc. with [[:alpha:]]+
or similar.
If the function is still slow, the goal is now to parallelize a simple
lapply()
with modest data. On Linux or macOS the easiest solution is
to use parallel::mclapply()
## save some cores for other things...
options(mc.cores = parallel::detectCores() - 2L)
fuzzy_match_parallel_mclapply <- function(str, patterns) {
mcapply(patterns, function(pattern, str) {
which(stringi::stri_detect_regex(str, pattern))
}, str)
}
On Windows the easiest approach is to use parallel::parLapply()
with
a socket-based back end
fuzzy_match_parallel_sockets <- function(str, patterns) {
parLapply(cl, patterns, function(pattern, str) {
which(stringi::stri_detect_regex(str, pattern))
}, str)
}
cl <- makePSOCKcluster(detectCores() - 2L)
joins <- fuzzy_match_parallel_sockets(db1$model, db2$brand_re)
stopCluster(cl)
Subsequent processing steps are unchanged.