I'm trying to join two tables in dplyr. Sometimes it's possible to match exact on the column year, but in some cases the matching year is not available. In that case, i would like to join on the maximum year
Left <- tibble(id = c(1,2,3),
year = c(2010,2010,2012))
Right <- tibble(id = c(1,1,2,3,3),
year = c(2010,2011,2010,2010,2011),
new = c(T,T,T,T,T))
Joined <- left_join(Left, Right, by = c("id", "year"))
# A tibble: 3 x 3
id year new
<dbl> <dbl> <lgl>
1 1 2010 TRUE
2 2 2010 TRUE
3 3 2012 NA
As you can see, id 3 is not matched, I tried the package fuzzyjoin, but i'm unable to fuzzy-join on one column, and exact-join on another:
Fuzzy_joined <- fuzzyjoin::difference_left_join(Left, Right, by = c("id", "year"))
Fuzzy_joined
# A tibble: 9 x 5
id.x year.x id.y year.y new
<dbl> <dbl> <dbl> <dbl> <lgl>
1 1 2010 1 2010 TRUE
2 1 2010 1 2011 TRUE
3 1 2010 2 2010 TRUE
4 2 2010 1 2010 TRUE
5 2 2010 1 2011 TRUE
6 2 2010 2 2010 TRUE
7 2 2010 3 2010 TRUE
8 2 2010 3 2011 TRUE
9 3 2012 3 2011 TRUE
What is the most efficient way to join non-matching cases on the smallest distance of a the year variable, and an exact match of the id variable, using dplyr syntax?