"fuzzy" inner_join in dplyr to keep both rows that do AND not exactly match

Question

I am working with two datasets that I would like to join based not exact matches between them, but rather approximate matches. My question is similar to this OP.

Here are examples of what my two dataframes look like.

df1 is this one:

df2 is this one:

I am currently using inner_join(df1, df2, by=c("x") to join the two together.

This gives me:

However, what I really want to do is join the two dfs based on these conditions:

any exact matches are joined first (exactly like how inner_join() currently works)
BUT, if there are no exact matches, then join to any match ± 0.5

The kind of output I am trying to get would look like this:

 x     y
 4.8   6.6
 12    1
 4     1    #the y value is from x=4.5 in df1
 4     0.5  #the y value is from x=3.5 in df1
 3.5   0.5
 12.5  1    #the y value is from x=12 in df1
 12.5  1.8  #the y value is from x=13 in df1

I typically work in dplyr, so a dplyr solution would be appreciated. But, I am also open to other suggestions because I don't know if dplyr will be flexible enough to do a "fuzzy" join.

(I am aware of the fuzzyjoin package, but it doesn't seem to get at exactly what I am trying to do here)

PaulS · Accepted Answer · 2022-04-01T20:03:45.670

3

A possible solution, with no join:

library(tidyverse)

df1 %>%
  rename(x1 = x) %>%
  crossing(df2) %>%
  mutate(diff = abs(x1-x)) %>% 
  filter(diff <= 0.5) %>% 
  group_by(x1) %>% 
  mutate(aux = any(diff == 0)) %>% 
  filter(aux*(diff == 0) | !aux) %>% 
  select(-diff, -aux) %>% 
  ungroup

#> # A tibble: 7 × 3
#>      x1     x     y
#>   <dbl> <dbl> <dbl>
#> 1   3.5   3.5   0.5
#> 2   4     3.5   0.5
#> 3   4     4.5   1  
#> 4   4.8   4.8   6.6
#> 5  12    12     1  
#> 6  12.5  12     1  
#> 7  12.5  13     1.8

edited Apr 01 '22 at 20:03

answered Apr 01 '22 at 16:44

PaulS

21,159
2
9
26

1

This does it nearly perfectly! One part is missing: how would you recommend dealing with perfect matches? I figure I could add an `if` `else` sort of statement and further `filter()` for the perfect matches, but I am wondering if there is a more straightforward way. – Blundering Ecologist Apr 01 '22 at 18:23
1

You are welcome! I do not quite understand you question, as `abs(x1 - x) <= 0.5` also includes the cases of perfect matches, as `x1 == x` satisfies the condition `abs(x1 - x) <= 0.5`. Or am I missing something? – PaulS Apr 01 '22 at 18:30
1

So, if we look at `x1==4.8` it is there twice because it is close to `x==4.5` and `x==4.8`, but is it possible for those cases (exact matches) to only be in the `df` once, despite that they might have non-exact matches as well? – Blundering Ecologist Apr 01 '22 at 19:10
1

Got your point. Please, see my edited answer. – PaulS Apr 01 '22 at 20:05

score 2 · Answer 2 · answered Apr 14 '22 at 13:08

2

You could use {powerjoin}

library(powerjoin)
power_left_join(
  df1, df2,
  by = ~ .x$x == .y$x | ! .x$x %in% .y$x & .x$x <= .y$x +.5 & .x$x >= .y$x -.5, 
  keep = "left")
#>      x   y
#> 1  4.8 6.6
#> 2 12.0 1.0
#> 3  4.0 1.0
#> 4  4.0 0.5
#> 5  3.5 0.5
#> 6 12.5 1.0
#> 7 12.5 1.8

^{Created on 2022-04-14 by the reprex package (v2.0.1)}

answered Apr 14 '22 at 13:08

moodymudskipper

46,417
11
121
167

1

I really like your solution, @moodymudskipper (+1)! And, Antoine, thanks for having co-created `powerjoin`! – PaulS Apr 14 '22 at 18:57

"fuzzy" inner_join in dplyr to keep both rows that do AND not exactly match

2 Answers2