0

I have two data frames that look like this

data1<-data.frame(name=c("Jessica_Smith","John_Smith","John_Smith") ,
max_year=c(2000,1989,2005), 
min_year=c(1990,1989,2001))

data2<-data.frame(name=c("Jessica_Smith","John_Smith","John_Smith") ,
year_of_birth=c(1995,1989,2002),unique_id=c("aby37","nf93","fnd34"))

I would like to merge the data frames such that for each person in data1 I have a unique_id. If possible I would also like to get the nearest integer match for those for whom there is no overlap. To clarify I can't just left_join on name because some entries have the same name but different dates of birth.

I'm relatively new to R so I don't even really know where to begin.

Much appreciated!

sd3184
  • 69
  • 4
  • why not just `left_join` – GuedesBF Nov 02 '21 at 00:20
  • Can you use left_join when one df has a range and the other only has a single number? – sd3184 Nov 02 '21 at 00:21
  • What if `year_of_birth` is not in range of `max_year` and `min_year`? Then do you want to drop that row? – Park Nov 02 '21 at 00:23
  • does `data1 %>% dplyr::left_join(data2, by="name")` work? – GuedesBF Nov 02 '21 at 00:24
  • yea then it should look for another match. To be clear this is a very large data set and the names don't match up in sequential order. – sd3184 Nov 02 '21 at 00:25
  • Edited for clarity. I can't merge on names because some of the names are the same. I need to include year of birth to get an accurate match. – sd3184 Nov 02 '21 at 00:28
  • 1
    If you need to get the closest match for those that do not fit within one of the intervals, you really should provide sample data that actually needs that. As it stands, neither answer produces that behavior. – r2evans Nov 02 '21 at 13:08
  • 1
    Further, joining on year alone (and not name) is going to cause a lot of problems, I suspect. – r2evans Nov 02 '21 at 13:11

2 Answers2

3

With data.table you can to non-equal joins:
(added some extra columns and a name with no matches to example)

library(data.table)
data1<-data.table(name=c("Jessica_Smith","John_Smith","John_Smith", "Jackson_Brown") ,
                  max_year=c(2000,1989,2005,2020), 
                  min_year=c(1990,1989,2001,1990),
                  extra_col1=c('a', 'b', 'c', 'd'))

data2<-data.table(name=c("Jessica_Smith","John_Smith","John_Smith") ,
                  year_of_birth=c(1995,1989,2002),
                  unique_id=c("aby37","nf93","fnd34"),
                  extra_col2=1:3)

data2[data1, .(name, year_of_birth = x.year_of_birth, unique_id,
               extra_col1, extra_col2), 
      on = .(name, year_of_birth >= min_year, year_of_birth <= max_year)]

            name year_of_birth unique_id extra_col1 extra_col2
1: Jessica_Smith          1995     aby37          a          1
2:    John_Smith          1989      nf93          b          2
3:    John_Smith          2002     fnd34          c          3
4: Jackson_Brown            NA      <NA>          d         NA

I think it's better to list all the column names you want to include in the result.
One difficulty with with the non-equal joins is the default renaming of "matching" columns. The columns match the non-equal criteria, but they are not the same. This is why I specified x.year_of_birth (you can use x. or i. to pick which data.table the column comes from in x[i]).

Here is the result with no columns selected (which normally gives you all columns in a join):

data2[data1,
      on = .(name, year_of_birth >= min_year, year_of_birth <= max_year)]

            name year_of_birth unique_id extra_col2 year_of_birth.1 extra_col1
1: Jessica_Smith          1990     aby37          1            2000          a
2:    John_Smith          1989      nf93          2            1989          b
3:    John_Smith          2001     fnd34          3            2005          c
4: Jackson_Brown          1990      <NA>         NA            2020          d

Notice that year_of_birth here is really min_year, and year_of_birth.1 is really max_year. So this is confusing and data2$year_of_birth is not included because it's considered a "matched" column that doesn't need to be kept.

Brian Montgomery
  • 2,244
  • 5
  • 15
  • great answer, just a minor comment: when you join data2 into data1, your column `year_of_birth` maintains the correct values. `data1[data2, .(name, year_of_birth, unique_id), on = .(name, max_year >= year_of_birth, min_year <= year_of_birth)]` – hi_everyone Nov 04 '21 at 09:28
  • One thing I forgot to mention is that there are more columns in data1 and data2 than listed. Do I need to list all of those or can I retain all the columns with a simple command? – sd3184 Nov 04 '21 at 14:21
1

You may try,

data1 %>% 
  left_join(data2, by = "name") %>%
  rowwise %>%
  mutate(year_key = min_year <= year_of_birth && year_of_birth <= max_year) %>%
  filter(year_key == TRUE) %>% 
  select(-year_key)

  name          max_year min_year year_of_birth unique_id
  <chr>            <dbl>    <dbl>         <dbl> <chr>    
1 Jessica_Smith     2000     1990          1995 aby37    
2 John_Smith        1989     1989          1989 nf93     
3 John_Smith        2005     2001          2002 fnd34  
Park
  • 14,771
  • 6
  • 10
  • 29