1

I wrote a nested for loop in R, but the loop is taking way too long to run. I have two big datasets. For every row in dfA and for every row in dfB, the loop should see if Date in dfA falls within Date Interval in dfB. If this is true, then the two datasets should merge on a given column for that row. I'm not sure if the code I wrote will work w/o error, because the loop is still running.

Any insight would be appreciated.

dfA:


       Common    a       Date 
1 20141331123    1        2005-01-01 
2 20141331123    2        2005-01-02 
3 20141331123    3        2005-01-03 
4 20141331123    4        2005-01-04 
5 20141331123    5        2005-01-05 
6 20141331123    6        2005-01-06 

dfB:


  cDate       bDate      common                      
1 2005-01-01 2005-06-13  20141331123 

dfB$Interval <- interval(ymd(dfB$cDate), ymd(dfB$bDate)) 

library(lubridate) 

for (i in 1:nrow(dfA)) {
  
  for (i in 1:nrow(dfB)) {
  
      if (dfA$Date[i] %within% dfB$Interval[i] == TRUE) {
        
        merged <- merge(dfA, dfB, by.x = c("common"), by.y = c("Common"))
        
      }
    
  }
  
  return(merged)

}

user12310746
  • 279
  • 1
  • 10
  • (1) Your code is *discarding all results*. If you want to `merge`, then you need to capture that return value somewhere. Without that, you are doing all of the work with nothing to show for it. (2) If you're looking at a range-based join operation, then I *strongly* suggest doing a non-equi join and stop trying loops. `data.table` and SQL do it natively (sadly, base-R and `dplyr` do not). – r2evans Sep 04 '20 at 20:54
  • ... and where does `apply()` come in here? – r2evans Sep 04 '20 at 20:55
  • It would help immensely if you provide small, representative sample data using `dput`. That is, a handful of rows from each, where some rows merge and some rows do not. – r2evans Sep 04 '20 at 20:58
  • I added some data. I used `lubridate` because I want to see if Date in dfA falls within a date interval in dfB. I thought apply() would come into play to replace the for loops? My understanding could be incorrect. I'm still learning how to use R. – user12310746 Sep 04 '20 at 21:08

2 Answers2

4

Non-equal joins are supported in SQL natively, and in data.table within R. Neither base R nor tidyverse functions support it locally[1].

library(data.table)
setDT(dfA)
setDT(dfB)
dfB[dfA, on = .(common == Common, cDate <= Date, bDate >= Date)]
#         cDate      bDate      common a
# 1: 2005-01-01 2005-01-01 20141331123 1
# 2: 2005-01-02 2005-01-02 20141331123 2
# 3: 2005-01-03 2005-01-03 20141331123 3
# 4: 2005-01-04 2005-01-04 20141331123 4
# 5: 2005-01-05 2005-01-05 20141331123 5
# 6: 2005-01-06 2005-01-06 20141331123 6

The sample data is a little uninteresting in that everything fits in the single interval, but perhaps this will work with your more varied data.

[1]: Since SQL supports it, it's supported in dbplyr using sql_on.


Data:

dfA <- structure(list(Common = c("20141331123", "20141331123", "20141331123", "20141331123", "20141331123", "20141331123"), a = 1:6, Date = structure(c(12784, 12785, 12786, 12787, 12788, 12789), class = "Date")), row.names = c(NA, -6L), class = "data.frame")
dfB <- structure(list(cDate = structure(12784, class = "Date"), bDate = structure(12947, class = "Date"), common = "20141331123"), row.names = c(NA, -1L), class = "data.frame")
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thank you. This is helpful and super fast. The data merged correctly, but is there a way to do this merge while also including the Date column in dfA? – user12310746 Sep 09 '20 at 01:28
  • For now, what I've done is make a duplicate of the Date column so that it's included in the final datatable. – user12310746 Sep 09 '20 at 01:35
  • 1
    Yes, you tend to lose one of the three fields in a non-equi merge (a frustration I share). Copying the field is how I cope as well. – r2evans Sep 09 '20 at 01:49
  • Okay, got it. I'm really amazed with how fast this merge is. – user12310746 Sep 09 '20 at 01:52
2

Consider a straightforward merge and subset if data size allows it.

final_df <- subset(merge(dfA, dfB, by.x="Common", by.y="common"),
                   Date >= cDate & Date <= bDate)

final_df
#        Common a       Date      cDate      bDate
# 1 20141331123 1 2005-01-01 2005-01-01 2005-06-13
# 2 20141331123 2 2005-01-02 2005-01-01 2005-06-13
# 3 20141331123 3 2005-01-03 2005-01-01 2005-06-13
# 4 20141331123 4 2005-01-04 2005-01-01 2005-06-13
# 5 20141331123 5 2005-01-05 2005-01-01 2005-06-13
# 6 20141331123 6 2005-01-06 2005-01-01 2005-06-13

Online Demo

Parfait
  • 104,375
  • 17
  • 94
  • 125
  • 1
    This starts with a near-cartesian expansion (per-`common`), doesn't it? – r2evans Sep 04 '20 at 21:28
  • 1
    Yes, if values in `common` are all same in both datasets as sample suggests. I qualify this solution works if data sizes allow it. – Parfait Sep 04 '20 at 21:30