1

I have understood that fuzzyjoin::difference will join two tables based on absolute difference between columns. Is there an R function that will join tables based on relative/percentage differences? I could do so using a full_join() + filter() but I suspect there is a more straightforward way.

Minimal example as follows:

library(tidyverse)
library(fuzzyjoin)
df_1 <- tibble(id = c("wombat", "jerry", "akow"), scores  = c(10,  50, 75))
df_2 <- tibble(id= c("wombat", "jerry", "akow"),  scores = c(14,  45, 82))

# joining based on absolute difference
difference_full_join(df_1, df_2, 
                     by=c("scores"), 
                     max_dist= 5, 
                     distance_col = "abs_diff" )

# A tibble: 4 x 5
  id.x   scores.x id.y   scores.y abs_diff
  <chr>     <dbl> <chr>     <dbl>    <dbl>
1 wombat       10 wombat       14        4
2 jerry        50 jerry        45        5
3 akow         75 NA           NA       NA
4 NA           NA akow         82       NA


## joining based on relative difference (setting 10% as a threshold)
full_join(df_1, df_2, "id") %>%
  dplyr::filter(  (abs(scores.x - scores.y)/scores.x)  <=0.10)

# A tibble: 2 x 3
  id    scores.x scores.y
  <chr>    <dbl>    <dbl>
1 jerry       50       45
2 akow        75       82
Yonghao
  • 166
  • 6
  • `df_1 <- tibble(id = Cs(wombat, jerry, akow), scores = c(10, 50, 75))` gives an error `Error in Cs(wombat, jerry, akow) : could not find function "Cs"` – Ronak Shah Aug 07 '21 at 04:58
  • @Ronak Shah The incredibly helpful Cs() function is from the Hmisc package https://rdrr.io/cran/Hmisc/man/Cs.html But I have edited my post to omit its use. Thanks! – Yonghao Aug 07 '21 at 08:55
  • 1
    Belatedly discovered the match_fun arguments... percent_detect <- function(x, y) (abs(x-y)/x) <= 0.10 fuzzy_left_join(df_1, df_2, by=c("id", "scores"), match_fun = list(`==`, percent_detect)) – Yonghao Aug 07 '21 at 09:21
  • creating your own match_fun looks like the right answer! go ahead and post it as a self-answer :) – Arthur Yip Aug 26 '21 at 19:40

0 Answers0