2

When I want to join two data frames based on two intervals, I prefer to use the fuzzyjoin package because it is easy to read in my opinion. But when I need to work with large datasets, the fuzzyjoin package is a no-go because it is very slow. The data.table package luckily can do the same a lot of times faster and becomes after more practice easier to use. Here is a reproducible example with benchmark:

library(fuzzyjoin)
library(data.table)
library(microbenchmark)

microbenchmark(
  "fuzzyjoin" = {
    fuzzy_left_join(df, df_dates, by = c('group', 'start' = 'from', 'end' = 'to'),
                    match_fun = list(`==`, `<=`, `>=`))
  }, 
  "data.table" = {
    setDT(df)
    setDT(df_dates)
    df_dates[df, .(group, start, end, from, to, value), 
             on = .(group, from >= start, to <= end)]
  }
)
#> Unit: milliseconds
#>        expr       min         lq       mean     median         uq       max
#>   fuzzyjoin 149.17845 155.207436 159.053306 157.323145 159.930656 258.57719
#>  data.table   1.44272   1.618015   1.999427   1.704792   1.934678  20.37736
#>  neval cld
#>    100   b
#>    100  a

Created on 2022-12-15 with reprex v2.0.2

As you can see this simple computation it is already more than 100x times faster. So I was wondering why the fuzzyjoin package is so much slower than the data.table option? Is the data.table option always faster than fuzzyjoin?


df <- structure(list(group = c("A", "A", "A", "A", "A", "A", "B", "B", 
                               "B", "B", "B", "B"), start = structure(c(19327, 19330, 19333, 
                                                                        19336, 19339, 19342, 19327, 19330, 19333, 19336, 19339, 19342
                               ), class = "Date"), end = structure(c(19330, 19333, 19336, 19339, 
                                                                     19342, 19345, 19330, 19333, 19336, 19339, 19342, 19345), class = "Date")), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                    -12L))
df_dates <- structure(list(group = c("A", "A", "B", "B"), from = structure(c(19328, 
                                                                             19340, 19332, 19339), class = "Date"), to = structure(c(19329, 
                                                                                                                                     19341, 19333, 19340), class = "Date"), value = c(1, 3, 2, 4)), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                        -4L))
Quinten
  • 35,235
  • 5
  • 20
  • 53
  • 2
    This is an interesting question. I just accepted that data.table is by far the fastest within R and also across other programming languages See here: "There are a number of reasons why data.table is fast, but a key one is that unlike many other tools, it allows you to modify things in your table by reference, so it is changed in-situ rather than requiring the object to be recreated with your modifications. ..." – TarJae Dec 15 '22 at 19:26
  • 2
    One reason might be that [fuzzy join relies on other R packages](https://github.com/dgrtwo/fuzzyjoin/blob/master/R/fuzzy_join.R) like `dplyr` , `purr` and `tidyr`, whereas `data.table` is optimized Rcpp code. – Waldi Dec 15 '22 at 19:52
  • 2
    You might take a look at the development version of dplyr, which allows non-equi joins like fuzzyjoin and data.table, but I expect (based on other recent updates to dplyr) will be more performant than fuzzyjoin. https://www.tidyverse.org/blog/2022/11/dplyr-1-1-0-is-coming-soon/ – Jon Spring Dec 15 '22 at 20:35
  • 1
    Thank you @TarJae yes, data.table is incredible. Great link you have shared! – Quinten Dec 15 '22 at 20:47

1 Answers1

2

Looks like the development version of dplyr (1.0.99.9000) is in this case, on my computer, faster than data.table, and it offers an improved syntax over fuzzyjoin with join_by.

# added to microbenchmark code in OP
"dplyr 1.0.99.9000" = {
    left_join(df, df_dates, join_by(group, start <= from, end >= to))
  }


Unit: milliseconds
              expr      min        lq       mean    median        uq      max neval
         fuzzyjoin 131.2089 141.23245 155.888496 147.76050 160.55830 372.0843   100
 dplyr 1.0.99.9000   1.7140   1.84115   2.189630   1.94575   2.09340  13.0027   100
 data.table 1.14.6   2.0335   2.23480   2.868417   2.32665   2.58865  16.7603   100
Jon Spring
  • 55,165
  • 4
  • 35
  • 53
  • (I am getting the same output as your `fuzzyjoin` approach, but I see the data.table output is a little different in terms of including `from`/`to` values (copied from `start`/`end`) where there are no matches. Is that ok?) – Jon Spring Dec 15 '22 at 21:07
  • Thank you for your answer! That is interesting and totally fine! – Quinten Dec 15 '22 at 21:28