0

I'm trying to join two tables in dplyr. Sometimes it's possible to match exact on the column year, but in some cases the matching year is not available. In that case, i would like to join on the maximum year

Left <- tibble(id = c(1,2,3),
           year = c(2010,2010,2012))

Right <- tibble(id = c(1,1,2,3,3),
            year = c(2010,2011,2010,2010,2011),
            new = c(T,T,T,T,T))

Joined <- left_join(Left, Right, by = c("id", "year"))

# A tibble: 3 x 3
     id  year   new
  <dbl> <dbl> <lgl>
1     1  2010  TRUE
2     2  2010  TRUE
3     3  2012    NA

As you can see, id 3 is not matched, I tried the package fuzzyjoin, but i'm unable to fuzzy-join on one column, and exact-join on another:

Fuzzy_joined <- fuzzyjoin::difference_left_join(Left, Right, by = c("id", "year"))
Fuzzy_joined
# A tibble: 9 x 5
   id.x year.x  id.y year.y   new
  <dbl>  <dbl> <dbl>  <dbl> <lgl>
1     1   2010     1   2010  TRUE
2     1   2010     1   2011  TRUE
3     1   2010     2   2010  TRUE
4     2   2010     1   2010  TRUE
5     2   2010     1   2011  TRUE
6     2   2010     2   2010  TRUE
7     2   2010     3   2010  TRUE
8     2   2010     3   2011  TRUE
9     3   2012     3   2011  TRUE

What is the most efficient way to join non-matching cases on the smallest distance of a the year variable, and an exact match of the id variable, using dplyr syntax?

  • Can you please elaborate on *In that case, i would like to join on the maximum year*? – Sotos Jan 10 '18 at 10:46
  • @Sotos, In the case of id == 3, no match is made, so I want to match the Left row to the row on the Right table with the highest value on year. I edited the example in the question so there are two values with id ==3 in the Right table. I want to join Left id == 3 to the Right row id == 3 & year == 2011 – Jelger van Zaane Jan 10 '18 at 12:08
  • @RichardTelford, a full_join gives me 6 rows, what is the filter operation I should do? – Jelger van Zaane Jan 10 '18 at 12:11

1 Answers1

1

I would use a left join on id and year and then filter to get the best match for year

left_join(Left, Right, by = "id", suffix  = c("", "_r")) %>% 
  mutate(delta = year - year_r) %>% 
  filter(delta >= 0) %>% 
  group_by(id, year) %>% 
  slice(which.min(delta)) %>% 
  select(-delta)

# A tibble: 3 x 4
# Groups:   id, year [3]
     id  year year_r   new
  <dbl> <dbl>  <dbl> <lgl>
1     1  2010   2010  TRUE
2     2  2010   2010  TRUE
3     3  2012   2011  TRUE

There might be more efficient solutions, but this will work well with moderate-sized data sets.

Richard Telford
  • 9,558
  • 6
  • 38
  • 51