I want to join two data frames one with a year range and the other with a year

Question

I have two data frames that look like this

data1<-data.frame(name=c("Jessica_Smith","John_Smith","John_Smith") ,
max_year=c(2000,1989,2005), 
min_year=c(1990,1989,2001))

data2<-data.frame(name=c("Jessica_Smith","John_Smith","John_Smith") ,
year_of_birth=c(1995,1989,2002),unique_id=c("aby37","nf93","fnd34"))

I would like to merge the data frames such that for each person in data1 I have a unique_id. If possible I would also like to get the nearest integer match for those for whom there is no overlap. To clarify I can't just left_join on name because some entries have the same name but different dates of birth.

I'm relatively new to R so I don't even really know where to begin.

Much appreciated!

Can you use left_join when one df has a range and the other only has a single number? — sd3184, Nov 02 '21 at 00:21
What if `year_of_birth` is not in range of `max_year` and `min_year`? Then do you want to drop that row? — Park, Nov 02 '21 at 00:23
yea then it should look for another match. To be clear this is a very large data set and the names don't match up in sequential order. — sd3184, Nov 02 '21 at 00:25
Edited for clarity. I can't merge on names because some of the names are the same. I need to include year of birth to get an accurate match. — sd3184, Nov 02 '21 at 00:28
If you need to get the closest match for those that do not fit within one of the intervals, you really should provide sample data that actually needs that. As it stands, neither answer produces that behavior. — r2evans, Nov 02 '21 at 13:08
Further, joining on year alone (and not name) is going to cause a lot of problems, I suspect. — r2evans, Nov 02 '21 at 13:11

Brian Montgomery · Answer 1 · 2021-11-05T00:55:05.430

With data.table you can to non-equal joins:
(added some extra columns and a name with no matches to example)

library(data.table)
data1<-data.table(name=c("Jessica_Smith","John_Smith","John_Smith", "Jackson_Brown") ,
                  max_year=c(2000,1989,2005,2020), 
                  min_year=c(1990,1989,2001,1990),
                  extra_col1=c('a', 'b', 'c', 'd'))

data2<-data.table(name=c("Jessica_Smith","John_Smith","John_Smith") ,
                  year_of_birth=c(1995,1989,2002),
                  unique_id=c("aby37","nf93","fnd34"),
                  extra_col2=1:3)

data2[data1, .(name, year_of_birth = x.year_of_birth, unique_id,
               extra_col1, extra_col2), 
      on = .(name, year_of_birth >= min_year, year_of_birth <= max_year)]

            name year_of_birth unique_id extra_col1 extra_col2
1: Jessica_Smith          1995     aby37          a          1
2:    John_Smith          1989      nf93          b          2
3:    John_Smith          2002     fnd34          c          3
4: Jackson_Brown            NA      <NA>          d         NA

I think it's better to list all the column names you want to include in the result.
One difficulty with with the non-equal joins is the default renaming of "matching" columns. The columns match the non-equal criteria, but they are not the same. This is why I specified x.year_of_birth (you can use x. or i. to pick which data.table the column comes from in x[i]).

Here is the result with no columns selected (which normally gives you all columns in a join):

data2[data1,
      on = .(name, year_of_birth >= min_year, year_of_birth <= max_year)]

            name year_of_birth unique_id extra_col2 year_of_birth.1 extra_col1
1: Jessica_Smith          1990     aby37          1            2000          a
2:    John_Smith          1989      nf93          2            1989          b
3:    John_Smith          2001     fnd34          3            2005          c
4: Jackson_Brown          1990      <NA>         NA            2020          d

Notice that year_of_birth here is really min_year, and year_of_birth.1 is really max_year. So this is confusing and data2$year_of_birth is not included because it's considered a "matched" column that doesn't need to be kept.

great answer, just a minor comment: when you join data2 into data1, your column `year_of_birth` maintains the correct values. `data1[data2, .(name, year_of_birth, unique_id), on = .(name, max_year >= year_of_birth, min_year <= year_of_birth)]` — hi_everyone, Nov 04 '21 at 09:28
One thing I forgot to mention is that there are more columns in data1 and data2 than listed. Do I need to list all of those or can I retain all the columns with a simple command? — sd3184, Nov 04 '21 at 14:21

score 1 · Answer 2 · answered Nov 02 '21 at 00:37

You may try,

data1 %>% 
  left_join(data2, by = "name") %>%
  rowwise %>%
  mutate(year_key = min_year <= year_of_birth && year_of_birth <= max_year) %>%
  filter(year_key == TRUE) %>% 
  select(-year_key)

  name          max_year min_year year_of_birth unique_id
  <chr>            <dbl>    <dbl>         <dbl> <chr>    
1 Jessica_Smith     2000     1990          1995 aby37    
2 John_Smith        1989     1989          1989 nf93     
3 John_Smith        2005     2001          2002 fnd34

I want to join two data frames one with a year range and the other with a year

2 Answers2