3

I have used the call below to "join" my datasets based on an inequality condition:

library(sqldf)

sqldf("select *
from dataset1 a,
dataset2 b
a.col1 <= b.col2")

However, is there a way I can do this without sqldf?

So far, I can only see merge functions that are based on simple joins on a particular common column.

Thanks!

Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
anirana
  • 39
  • 1
  • 3

4 Answers4

5

Non-equi (or conditional) joins were recently implemented in data.table, and available in the current development version, v1.9.7. See installation instructions here.

require(data.table) # v1.9.7+
setDT(dataset1) # convert to data.tables
setDT(dataset2)
dataset1[dataset2, on=.(col1 < col2), nomatch=0L]

For each row of dataset2, find matching row indices while joining on condition provided to the on argument, and return all columns for those matching rows.

Arun
  • 116,683
  • 26
  • 284
  • 387
1

I've had that problem a few times and I think I got a solution using dplyr! It might not be the best in terms of efficiency, but it works. I'll suppose you have a constant variable in each case called 'dummy' (or alternatively, it can be another variable to join by). Moreover, I assume dataset1's columns are a_colj and those of dataset2 are b_colj:

dataset1 %>%
    inner_join(dataset2, by='dummy') %>%
    filter(a_col1 <= b_col2)
Felipe Gerard
  • 1,552
  • 13
  • 23
0

You could definitely do it in two steps utilizing merge.

Example (the exact details of the merge are up to you):

lessRows <- which(df1$col1 < df2$col2)
df3 <- merge(df1, df2)[lessRows, ]
devmacrile
  • 470
  • 2
  • 7
0

In dplyr, you can use join_by for inequality joins. Use closest to get only the closest match:

sales <- tibble(
  id = c(1L, 1L, 1L, 2L, 2L),
  sale_date = as.Date(c("2018-12-31", "2019-01-02", "2019-01-05", "2019-01-04", "2019-01-01"))
)
sales
#      id sale_date 
# 1     1 2018-12-31
# 2     1 2019-01-02
# 3     1 2019-01-05
# 4     2 2019-01-04
# 5     2 2019-01-01

promos <- tibble(
  id = c(1L, 1L, 2L),
  promo_date = as.Date(c("2019-01-01", "2019-01-05", "2019-01-02"))
)
#      id promo_date
# 1     1 2019-01-01
# 2     1 2019-01-05
# 3     2 2019-01-02

left_join(sales, promos, join_by(id, sale_date >= promo_date))
#      id sale_date  promo_date
# 1     1 2018-12-31 NA        
# 2     1 2019-01-02 2019-01-01
# 3     1 2019-01-05 2019-01-01
# 4     1 2019-01-05 2019-01-05
# 5     2 2019-01-04 2019-01-02
# 6     2 2019-01-01 NA        
Maël
  • 45,206
  • 3
  • 29
  • 67