How to match two data.frames with an inexact matching identifier (one identifier has to be in the range of the other)

Question

I have the following matching problem: I have two data.frames, one with an observation every month (per company ID), and one with an observation every quarter (per company ID; note that quarter means fiscal quarter; therefore 1Q = Jan, Feb, Mar is not necessarily correct and also, a fiscal quarter is not necessarily 3 month long).

For every month and company, I want to get the correct value of that quarter. Consequently, several months have the same value for one quarter. As an example see the code below:

monthlyData <- data.frame(ID = rep(c("A", "B"), each = 5),
                  Month = rep(1:5, times = 2),
                  MonValue = 1:10)
monthlyData
   ID Month MonValue
1   A     1        1
2   A     2        2
3   A     3        3
4   A     4        4
5   A     5        5
6   B     1        6
7   B     2        7
8   B     3        8
9   B     4        9
10  B     5       10

#Quarterly data, i.e. the value of every quarter has to be matched to several months in d1
#However, I want to match fiscal quarters, which means that one quarter is not necessarily 3 month long
qtrData <- data.frame(ID = rep(c("A", "B"), each = 2),
                  startMonth = c(1, 4, 1, 3),
                  endMonth   = c(3, 5, 2, 5),
                  QTRValue   = 1:4)
qtrData
  ID startMonth endMonth QTRValue
1  A          1        3        1
2  A          4        5        2
3  B          1        2        3
4  B          3        5        4

#Desired output
   ID Month MonValue QTRValue
1   A     1        1        1
2   A     2        2        1
3   A     3        3        1
4   A     4        4        2
5   A     5        5        2
6   B     1        6        3
7   B     2        7        3
8   B     3        8        4
9   B     4        9        4
10  B     5       10        4

Note: This question was posted on R-help months ago, but I didn't get any answer then and found a solution myself (see R-help). Now, however, I posted a question on stackoverflow where I have a question regarding the data.table where this problem was mentioned as well and there, Andrie asked me to post this question again because he apparently has a good solution for it (see Question on SO)

UPDATE: See Matthew Dowle's comment: how does the real data look?

This data is a more realistic one. I added a few rows, but the only main part that changed is column endMonth in qtrData. More precisely, the startMonth is not necessarily the endMonth of the previous quarter plus one month anymore. Therefore, using the roll option, I think that you need another line of code (if not, you get 20 rows back, but with Andrie's solution, which is the desired one, you get 17 rows back). Then there is no performance difference anymore, if I don't miss anything here.

monthlyData_new <- data.table(ID = rep(c("A", "B"), each = 10),
                  Month = rep(1:10, times = 2),
                  MonValue = 1:20)

qtrData_new <- data.table(ID = rep(c("A", "B"), each = 3),
                  startMonth = c(1, 4, 7, 1, 3, 8),
                  endMonth   = c(3, 5, 10, 2, 5, 10),
                  QTRValue   = 1:6)

setkey(qtrData_new, ID)
setkey(monthlyData_new, ID)

qtrData1 <- qtrData_new
setkey(qtrData1, ID, startMonth)
monthlyData1 <- monthlyData_new
setkey(monthlyData1, ID, Month)

withTable1 <- function(){
  xx <- qtrData1[monthlyData1, roll=TRUE]
  xx <- xx[startMonth <= endMonth]

}

withTable2 <- function(){
  yy <- monthlyData_new[qtrData_new][Month >= startMonth & Month <= endMonth]

}

benchmark(withTable1, withTable2, replications=1e6)
        test replications elapsed relative user.self sys.self user.child sys.child
1 withTable1      1000000   4.244 1.028599     4.232    0.008          0         0
2 withTable2      1000000   4.126 1.000000     4.096    0.028          0         0

I never said I had a *good* solution. You should judge that for yourself :-) — Andrie, Nov 04 '11 at 15:12
I'm honest with you, @Andrie, that's not a _good_ solution...it's an awesome one! — Christoph_J, Nov 04 '11 at 15:32
@Andrie Just on the benchmark (from Andrie), that repeats a very small operation 1 million times. All that's timing is tons of call overhead. You need to construct a _large_ table and compare the times of a _single_ run of each method (the lowest of 3 runs, usually). — Matt Dowle, Nov 05 '11 at 00:48
@MatthewDowle OK, thanks for looking at that. I will adjust my code accordingly. — Christoph_J, Nov 05 '11 at 08:55

Matt Dowle · Answer 1 · 2011-11-05T01:16:57.330

Try this :

mD = data.table(monthlyData, key="ID,Month")
qD = data.table(qtrData,key="ID,startMonth")
qD[mD,roll=TRUE]
      ID startMonth endMonth QTRValue MonValue
 [1,]  A          1        3        1        1
 [2,]  A          2        3        1        2
 [3,]  A          3        3        1        3
 [4,]  A          4        5        2        4
 [5,]  A          5        5        2        5
 [6,]  B          1        2        3        6
 [7,]  B          2        2        3        7
 [8,]  B          3        5        4        8
 [9,]  B          4        5        4        9
[10,]  B          5        5        4       10

That should be much faster.

EDIT : Answering the follow-up edit in question. One way is use NA to store where the missing months are. I find it easier to look at one time series column (irregular with gaps and NA), than two making a series of ranges.

> mD <- data.table(ID = rep(c("A", "B"), each = 10),
+                  Month = rep(1:10, times = 2),
+                  MonValue = 1:20,  key="ID,Month")
>                  
> qD <- data.table(ID = rep(c("A", "B"), each = 4),
+                   Month = c(1,4,6,7, 1,3,6,8),
+                   QtrValue = c(1,2,NA,3, 4,5,NA,6),
+                   key="ID,Month")
>                   
> mD
      ID Month MonValue
 [1,]  A     1        1
 [2,]  A     2        2
 [3,]  A     3        3
 [4,]  A     4        4
 [5,]  A     5        5
 [6,]  A     6        6
 [7,]  A     7        7
 [8,]  A     8        8
 [9,]  A     9        9
[10,]  A    10       10
[11,]  B     1       11
[12,]  B     2       12
[13,]  B     3       13
[14,]  B     4       14
[15,]  B     5       15
[16,]  B     6       16
[17,]  B     7       17
[18,]  B     8       18
[19,]  B     9       19
[20,]  B    10       20
> qD
     ID Month QtrValue
[1,]  A     1        1
[2,]  A     4        2
[3,]  A     6       NA     # missing for 1 month  (6)
[4,]  A     7        3
[5,]  B     1        4
[6,]  B     3        5
[7,]  B     6       NA     # missing for 2 months (6 and 7)
[8,]  B     8        6
> qD[mD,roll=TRUE]
      ID Month QtrValue MonValue
 [1,]  A     1        1        1
 [2,]  A     2        1        2
 [3,]  A     3        1        3
 [4,]  A     4        2        4
 [5,]  A     5        2        5
 [6,]  A     6       NA        6
 [7,]  A     7        3        7
 [8,]  A     8        3        8
 [9,]  A     9        3        9
[10,]  A    10        3       10
[11,]  B     1        4       11
[12,]  B     2        4       12
[13,]  B     3        5       13
[14,]  B     4        5       14
[15,]  B     5        5       15
[16,]  B     6       NA       16
[17,]  B     7       NA       17
[18,]  B     8        6       18
[19,]  B     9        6       19
[20,]  B    10        6       20
> qD[mD,roll=TRUE][!is.na(QtrValue)]
      ID Month QtrValue MonValue
 [1,]  A     1        1        1
 [2,]  A     2        1        2
 [3,]  A     3        1        3
 [4,]  A     4        2        4
 [5,]  A     5        2        5
 [6,]  A     7        3        7
 [7,]  A     8        3        8
 [8,]  A     9        3        9
 [9,]  A    10        3       10
[10,]  B     1        4       11
[11,]  B     2        4       12
[12,]  B     3        5       13
[13,]  B     4        5       14
[14,]  B     5        5       15
[15,]  B     8        6       18
[16,]  B     9        6       19
[17,]  B    10        6       20

Thanks @Matthew Dowle, that's a good example for the `role` argument. However, within my real data I can not necessarily guarantee that `startMonth` of this quarter equals `endMonth` plus 1 of the quarter before. Therefore, I would still need a line `qd[startMonth <= endMonth]`. Not sure if this would change something with respect to the performance of both approaches though. — Christoph_J, Nov 04 '11 at 17:41
@Christoph_J Gaps are fine. There isn't a +1 assumption. Try it. See `?data.table` for examples of `roll`. — Matt Dowle, Nov 04 '11 at 18:27
@Christoph_J Or to save confusion, please post a more realistic example data set and I'll see what I can do. — Matt Dowle, Nov 04 '11 at 18:36
+1 I still have no idea why this works, but it's pretty cool. — Andrie, Nov 04 '11 at 19:19
@Andrie See `?data.table`, the long paragraph for the `roll` argument. — Matt Dowle, Nov 04 '11 at 19:56
@MatthewDowle I updated my answer so that it is a better proxy for my data. But still good to know in which cases you can use `role`. It's definitely one of those options that I didn't get when I read the 10 minutes introduction a couple of times, but now it's definitely clear to me ;-) So thanks! — Christoph_J, Nov 04 '11 at 21:05
@Christoph_J Great thanks. Hope latest edit gives a few more ideas to throw into the mix. — Matt Dowle, Nov 05 '11 at 01:19

score 3 · Accepted Answer · edited Jun 20 '20 at 09:12

Here are two solutions, using Base R and data.table. Since the data.table solution is about 30% faster than base R, and also much easier to read, I recommend using data.table for this.

Base R

Since you expressed a wish to have this efficient, I use vapply:

matchData <- function(id, month, data=d2){
  vapply(seq_along(id), 
      function(i)which(
            id[i]==data$ID & 
                month[i] >= data$startMonth & 
                month[i] <= data$endMonth),
      FUN.VALUE=1,
      USE.NAMES=FALSE
      )
}


within(monthlyData, 
    Value <- qtrData$QTRValue[matchData(
               monthlyData$ID, monthlyData$Month, qtrData)]
)

   ID Month MonValue Value
1   A     1        1     1
2   A     2        2     1
3   A     3        3     1
4   A     4        4     2
5   A     5        5     2
6   B     1        6     3
7   B     2        7     3
8   B     3        8     4
9   B     4        9     4
10  B     5       10     4

data.table

And also demonstrating how to do this using data.table:

mD <- data.table(monthlyData, key="ID")
qD <- data.table(qtrData, key="ID")
mD[qD][Month>=startMonth & Month<=endMonth]


      ID Month MonValue startMonth endMonth QTRValue
 [1,]  A     1        1          1        3        1
 [2,]  A     2        2          1        3        1
 [3,]  A     3        3          1        3        1
 [4,]  A     4        4          4        5        2
 [5,]  A     5        5          4        5        2
 [6,]  B     1        6          1        2        3
 [7,]  B     2        7          1        2        3
 [8,]  B     3        8          3        5        4
 [9,]  B     4        9          3        5        4
[10,]  B     5       10          3        5        4

Benchmark

I was curious how these two approaches compare:

library(rbenchmark)

withBase <- function(){
  xx <- within(monthlyData, 
      Value <- qtrData$QTRValue[matchData(monthlyData$ID, monthlyData$Month, qtrData)])
  
}

withTable <- function(){
  yy <- mD[qD][Month>=startMonth & Month<=endMonth]
  
}

benchmark(withBase, withTable, replications=1e6)

       test replications elapsed relative user.self sys.self user.child
1  withBase      1000000   10.09 1.296915      7.65     0.21         NA
2 withTable      1000000    7.78 1.000000      6.38     0.16         NA

That's a very very slow approach in `data.table` ;) `roll=TRUE` is the ticket here. — Matt Dowle, Nov 04 '11 at 15:52

How to match two data.frames with an inexact matching identifier (one identifier has to be in the range of the other)

2 Answers2

Base R

data.table

Benchmark