Dplyr join on maximum matching value, if no exact match is possible

Question

I'm trying to join two tables in dplyr. Sometimes it's possible to match exact on the column year, but in some cases the matching year is not available. In that case, i would like to join on the maximum year

Left <- tibble(id = c(1,2,3),
           year = c(2010,2010,2012))

Right <- tibble(id = c(1,1,2,3,3),
            year = c(2010,2011,2010,2010,2011),
            new = c(T,T,T,T,T))

Joined <- left_join(Left, Right, by = c("id", "year"))

# A tibble: 3 x 3
     id  year   new
  <dbl> <dbl> <lgl>
1     1  2010  TRUE
2     2  2010  TRUE
3     3  2012    NA

As you can see, id 3 is not matched, I tried the package fuzzyjoin, but i'm unable to fuzzy-join on one column, and exact-join on another:

Fuzzy_joined <- fuzzyjoin::difference_left_join(Left, Right, by = c("id", "year"))
Fuzzy_joined
# A tibble: 9 x 5
   id.x year.x  id.y year.y   new
  <dbl>  <dbl> <dbl>  <dbl> <lgl>
1     1   2010     1   2010  TRUE
2     1   2010     1   2011  TRUE
3     1   2010     2   2010  TRUE
4     2   2010     1   2010  TRUE
5     2   2010     1   2011  TRUE
6     2   2010     2   2010  TRUE
7     2   2010     3   2010  TRUE
8     2   2010     3   2011  TRUE
9     3   2012     3   2011  TRUE

What is the most efficient way to join non-matching cases on the smallest distance of a the year variable, and an exact match of the id variable, using dplyr syntax?

Can you please elaborate on *In that case, i would like to join on the maximum year*? — Sotos, Jan 10 '18 at 10:46
@Sotos, In the case of id == 3, no match is made, so I want to match the Left row to the row on the Right table with the highest value on year. I edited the example in the question so there are two values with id ==3 in the Right table. I want to join Left id == 3 to the Right row id == 3 & year == 2011 — Jelger van Zaane, Jan 10 '18 at 12:08
@RichardTelford, a full_join gives me 6 rows, what is the filter operation I should do? — Jelger van Zaane, Jan 10 '18 at 12:11

score 1 · Accepted Answer · answered Jan 10 '18 at 13:04

I would use a left join on id and year and then filter to get the best match for year

left_join(Left, Right, by = "id", suffix  = c("", "_r")) %>% 
  mutate(delta = year - year_r) %>% 
  filter(delta >= 0) %>% 
  group_by(id, year) %>% 
  slice(which.min(delta)) %>% 
  select(-delta)

# A tibble: 3 x 4
# Groups:   id, year [3]
     id  year year_r   new
  <dbl> <dbl>  <dbl> <lgl>
1     1  2010   2010  TRUE
2     2  2010   2010  TRUE
3     3  2012   2011  TRUE

There might be more efficient solutions, but this will work well with moderate-sized data sets.

Dplyr join on maximum matching value, if no exact match is possible

1 Answers1