2

I have two datasets that are formatted like this:

df1
#>           Artist          Album Year
#> 1        Beatles  Sgt. Pepper's 1967
#> 2 Rolling Stones Sticky Fingers 1971

and

df2
#>            Album Year      Producer
#> 1  Sgt. Pepper's 1966 George Martin
#> 2 Sticky Fingers 1971  Jimmy Miller

I'd like to do an inner_join by Album and Year, but sometimes the 'Year' field is off by one year: for example, Sgt. Peppers is listed as 1967 in df1 by 1966 in df2.

So if I run:

df3 <- inner_join(df1, df2, by = c("Album", "Year"))

I get:

df3
#>           Artist          Album Year     Producer
#> 1 Rolling Stones Sticky Fingers 1971 Jimmy Miller

Whereas, I would like both albums to join, so long as something like (df1$Year == df2$Year + 1)|(df1$Year == df2$Year - 1).

I cannot just simple join by 'Album' because in my real data set there are some identically titled 'Albums' that are distinguished by 'Year'.

Code for the datasets below:

df1 <- data.frame(stringsAsFactors=FALSE,
      Artist = c("Beatles", "Rolling Stones"),
       Album = c("Sgt. Pepper's", "Sticky Fingers"),
        Year = c(1967, 1971)
)
df1

df2 <- data.frame(stringsAsFactors=FALSE,
       Album = c("Sgt. Pepper's", "Sticky Fingers"),
        Year = c(1966, 1971),
    Producer = c("George Martin", "Jimmy Miller")
)
df2
Jeremy K.
  • 1,710
  • 14
  • 35

5 Answers5

2

We could try using the sqldf package here, since your requirement can easily be phrased using a SQL join:

library(sqldf)

sql <- "SELECT t1.Artist, t1.Album, t1.Year, t2.Album, t2.Year, t2.Producer
        FROM df1 t1
        INNER JOIN df2 t2
            ON ABS(t1.Year - t2.Year) <= 1"
df3 <- sqldf(sql)

If you want to select all fields from both tables, then use:

SELECT t1.*, t2.* FROM ...

But note that in general SELECT * is frowned upon, and it is desirable to always list out the columns to select.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • If I want to `SELECT` all fields from both t1 and t2, is there a simple wildcard for the first line of code? Thank you. – Jeremy K. Jun 15 '19 at 04:11
2

Maybe a rolling join would take care of this. It works for your data sample, but it's possible there might be tricky edge cases in your actual data.

In the code below, roll="nearest" will match the nearest year value for each album (the "rolling" part applies only to the last join column, Year in this case).

library(data.table)

setDT(df1)
setDT(df2)

setkey(df1, Album, Year)
setkey(df2, Album, Year)

joined = df1[df2, roll="nearest"]

joined
           Artist          Album Year      Producer
1:        Beatles  Sgt. Pepper's 1966 George Martin
2: Rolling Stones Sticky Fingers 1971  Jimmy Miller
eipi10
  • 91,525
  • 24
  • 209
  • 285
1

Add Year + 1 to df2 and then join? You can also add Year - 1 if you want to cover range in both direction.

library(dplyr)

inner_join(df1, df2 %>%  bind_rows(df2 %>%  mutate(Year = Year + 1)),
                by = c("Album", "Year"))

#          Artist          Album Year      Producer
#1        Beatles  Sgt. Pepper's 1967 George Martin
#2 Rolling Stones Sticky Fingers 1971  Jimmy Miller
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Can I check my understanding of your solution, please? Am I right in thinking that you're using `bind_rows` to essentially copy and add on a duplicate of df2 with all of the years increased, so then df2 is twice the size, and more likely to join? – Jeremy K. Jun 15 '19 at 04:09
  • 1
    Yes, exactly. It's just a copy of df2 with years value increased by 1. – Ronak Shah Jun 15 '19 at 04:34
  • Is there a way I can cover the other direction in the same code? I tried `inner_join(df1, df2 %>% bind_rows(df2 %>% mutate(Year = Year + 1) %>% bind_rows(df2 %>% mutate(Year = Year - 1)), by = c("Album", "Year")))` but I must be doing something wrong. Thank you – Jeremy K. Jun 15 '19 at 06:18
  • 1
    @RAndStata yes, do `inner_join(df1, bind_rows(df2, df2 %>% mutate(Year = Year + 1), df2 %>% mutate(Year = Year - 1)), by = c("Album", "Year"))` – Ronak Shah Jun 15 '19 at 13:37
1

For the sake of completeness, this can also be solved using data.table's non-equi joins:

library(data.table)
setDT(df1)[, c(.SD, .(ym1 = Year - 1, yp1 = Year + 1))][
  setDT(df2), on = .(Album, ym1 <= Year, yp1 >= Year), nomatch = 0L]
           Artist          Album Year  ym1  yp1      Producer
1:        Beatles  Sgt. Pepper's 1967 1966 1966 George Martin
2: Rolling Stones Sticky Fingers 1971 1971 1971  Jimmy Miller

or

setDT(df1)[, c("ym1", "yp1") := .(Year - 1, Year + 1)][setDT(df2), 
           on = .(Album, ym1 <= Year, yp1 >= Year), nomatch = 0L]
           Artist          Album Year  ym1  yp1      Producer
1:        Beatles  Sgt. Pepper's 1967 1966 1966 George Martin
2: Rolling Stones Sticky Fingers 1971 1971 1971  Jimmy Miller

which modifies df1.


BTW: There is a feature request https://github.com/Rdatatable/data.table/issues/1639 to allow for on-the-fly columns in on. If this is implemented, above expression would become

setDT(df1)[setDT(df2), on = .(Album, Year - 1 <= Year, Year + 1 >= Year), nomatch = 0L]
Uwe
  • 41,420
  • 11
  • 90
  • 134
0

If anyone is reading this question in the future, the above answers are great. Another answer is to:

  1. Join all of the matching Albums
  2. Filter out only the records where the years are close:

https://stackoverflow.com/a/55863846/8742237

inner_join(df1, df2, by = c("Album")) %>% 
  filter(abs(Year.x - Year.y)<2)

>           Artist          Album Year.x Year.y      Producer
> 1        Beatles  Sgt. Pepper's   1967   1966 George Martin
> 2 Rolling Stones Sticky Fingers   1971   1971  Jimmy Miller
Jeremy K.
  • 1,710
  • 14
  • 35