When I want to join two data frames based on two intervals, I prefer to use the fuzzyjoin
package because it is easy to read in my opinion. But when I need to work with large datasets, the fuzzyjoin
package is a no-go because it is very slow. The data.table
package luckily can do the same a lot of times faster and becomes after more practice easier to use. Here is a reproducible example with benchmark:
library(fuzzyjoin)
library(data.table)
library(microbenchmark)
microbenchmark(
"fuzzyjoin" = {
fuzzy_left_join(df, df_dates, by = c('group', 'start' = 'from', 'end' = 'to'),
match_fun = list(`==`, `<=`, `>=`))
},
"data.table" = {
setDT(df)
setDT(df_dates)
df_dates[df, .(group, start, end, from, to, value),
on = .(group, from >= start, to <= end)]
}
)
#> Unit: milliseconds
#> expr min lq mean median uq max
#> fuzzyjoin 149.17845 155.207436 159.053306 157.323145 159.930656 258.57719
#> data.table 1.44272 1.618015 1.999427 1.704792 1.934678 20.37736
#> neval cld
#> 100 b
#> 100 a
Created on 2022-12-15 with reprex v2.0.2
As you can see this simple computation it is already more than 100x times faster. So I was wondering why the fuzzyjoin
package is so much slower than the data.table
option? Is the data.table
option always faster than fuzzyjoin
?
df <- structure(list(group = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), start = structure(c(19327, 19330, 19333,
19336, 19339, 19342, 19327, 19330, 19333, 19336, 19339, 19342
), class = "Date"), end = structure(c(19330, 19333, 19336, 19339,
19342, 19345, 19330, 19333, 19336, 19339, 19342, 19345), class = "Date")), class = "data.frame", row.names = c(NA,
-12L))
df_dates <- structure(list(group = c("A", "A", "B", "B"), from = structure(c(19328,
19340, 19332, 19339), class = "Date"), to = structure(c(19329,
19341, 19333, 19340), class = "Date"), value = c(1, 3, 2, 4)), class = "data.frame", row.names = c(NA,
-4L))