Merge dataframes on matching A, B and closest C?

Question

I have two dataframes like so:

set.seed(1)
df <- cbind(expand.grid(x=1:3, y=1:5), time=round(runif(15)*30))
to.merge <- data.frame(x=c(2, 2, 2, 3, 2),
                       y=c(1, 1, 1, 5, 4),
                       time=c(17, 12, 11.6, 22.5, 2),
                       val=letters[1:5],
                       stringsAsFactors=F)

I want to merge to.merge into df (with all.x=T) such that:

df$x == to.merge$x AND
df$y == to.merge$y AND
abs(df$time - to.merge$time) <= 1; in the case of multiple to.merge that satisfy, we pick the one that minimises this distances.

How can I do this?

So my desired result is (this is just df with the corresponding value column of to.merge added for matching rows):

   x y time val
1  1 1    8  NA
2  2 1   11   c
3  3 1   17  NA
4  1 2   27  NA
5  2 2    6  NA
6  3 2   27  NA
7  1 3   28  NA
8  2 3   20  NA
9  3 3   19  NA
10 1 4    2  NA
11 2 4    6  NA
12 3 4    5  NA
13 1 5   21  NA
14 2 5   12  NA
15 3 5   23   d

where to.merge was:

  x y time val
1 2 1 17.0   a
2 2 1 12.0   b
3 2 1 11.6   c
4 3 5 22.5   d
5 2 4  2.0   e

Note - (2, 1, 17, a) didn't match into df because the time 17 was more than 1 away from df$time 11 for (X, Y) = (2, 1).

Also, there were two rows in to.merge that satisfied the condition for matching to df's (2, 1, 11) row, but the 'c' row was picked instead of the 'b' row because its time was the closest to 11.

Finally, there may be rows in to.merge that do not match anything in df.

One way that works is a for-loop, but it takes far too long for my data (df has ~12k rows and to.merge has ~250k rows)

df$value <- NA
for (i in 1:nrow(df)) {
    row <- df[i, ]
    idx <- which(row$x == to.merge$x &
                 row$y == to.merge$y &
                 abs(row$time - to.merge$time) <= 1)
    if (length(idx)) {
        j <- idx[which.min(row$time - to.merge$time[idx])]
        df$val[i] <- to.merge$val[j]
    }
}

I feel that I can somehow do a merge, like:

to.merge$closest_time_in_df <- sapply(to.merge$time,
                                  function (tm) {
                                     dts <- abs(tm - df$time)
                                     # difference must be at most 1
                                     if (min(dts) <= 1) {
                                         df$time[which.min(dts)]
                                     } else {
                                         NA
                                     }
                                  })
merge(df, to.merge,
      by.x=c('x', 'y', 'time'),
      by.y=c('x', 'y', 'closest_time_in_df'),
      all.x=T)

But this doesn't merge the (2, 1, 11) row because to.merge$closest_time_in_df for (2, 1, 11.5, c) is 12, but a time of 12 in df corresponds to (x, y) = (2, 5) not (2, 1) hence the merge fails.

mnel · Answer 1 · 2013-04-19T07:00:19.727

16

Use data.table and roll='nearest' or to limit to 1, roll = 1, rollends = c(TRUE,TRUE)

eg

library(data.table)
# create data.tables with the same key columns (x, y, time)
DT <- data.table(df, key = names(df))
tm <- data.table(to.merge, key = key(DT))

# use join syntax with roll = 'nearest'


tm[DT, roll='nearest']

#     x y time val
#  1: 1 1    8  NA
#  2: 1 2   27  NA
#  3: 1 3   28  NA
#  4: 1 4    2  NA
#  5: 1 5   21  NA
#  6: 2 1   11   c
#  7: 2 2    6  NA
#  8: 2 3   20  NA
#  9: 2 4    6   e
# 10: 2 5   12  NA
# 11: 3 1   17  NA
# 12: 3 2   27  NA
# 13: 3 3   19  NA
# 14: 3 4    5  NA
# 15: 3 5   23   d

You can limit your self to looking forward and back (1) by setting roll=-1 and rollends = c(TRUE,TRUE)

new <- tm[DT, roll=-1, rollends  =c(TRUE,TRUE)]
new
    x y time val
 1: 1 1    8  NA
 2: 1 2   27  NA
 3: 1 3   28  NA
 4: 1 4    2  NA
 5: 1 5   21  NA
 6: 2 1   11   c
 7: 2 2    6  NA
 8: 2 3   20  NA
 9: 2 4    6  NA
10: 2 5   12  NA
11: 3 1   17  NA
12: 3 2   27  NA
13: 3 3   19  NA
14: 3 4    5  NA
15: 3 5   23   d

Or you can roll=1 first, then roll=-1, then combine the results (tidying up the val.1 column from the second rolling join)

new <- tm[DT, roll = 1][tm[DT,roll=-1]][is.na(val), val := ifelse(is.na(val.1),val,val.1)][,val.1 := NULL]
new
    x y time val
 1: 1 1    8  NA
 2: 1 2   27  NA
 3: 1 3   28  NA
 4: 1 4    2  NA
 5: 1 5   21  NA
 6: 2 1   11   c
 7: 2 2    6  NA
 8: 2 3   20  NA
 9: 2 4    6  NA
10: 2 5   12  NA
11: 3 1   17  NA
12: 3 2   27  NA
13: 3 3   19  NA
14: 3 4    5  NA
15: 3 5   23   d

edited Apr 19 '13 at 07:00

answered Apr 19 '13 at 01:54

mnel

113,303
27
265
254

Is your input data different? Your output is not matching with OP's desired output. – CHP Apr 19 '13 at 02:22
Your input data appears different to mine. However I tried with my input data, and your solution still merges `DT`'s (2, 4, 6) row with `tm`'s (2, 4, 2) row, which it shouldn't, because the difference in times here is more than 1 (as stated in the question) – mathematical.coffee Apr 19 '13 at 02:30
@geektrader. Good catch. I hadn't run `set.seed(1)`. I've also now included the actual answer to his question (:blush:) – mnel Apr 19 '13 at 02:34
Thanks, slick answer! I accepted geektrader's purely because I don't have to use `data.table` for it (which confuses me mightily), but I may eventually transition to `data.table` so I'll keep this in mind! – mathematical.coffee Apr 19 '13 at 02:36
1

Just a comment - I've been playing with this a bit and have noticed that for my actual data, `roll=1` will only merge some rows, and `roll=-1` will merge some others, and only if I combine the two will I get all that I expect to be merged. (and I can't seem to reproduce it in my particular toy dataset for the question, so I'm not sure what causes it). Does this ring a bell with you as to why it would be happening? – mathematical.coffee Apr 19 '13 at 06:13
1

It does, I had a long convaluted approach rolling -1, then 1 and then replacing NA values appropriately, but thought the rollends solution was identical. hmmm – mnel Apr 19 '13 at 06:21
2

OK, I have worked out how to reproduce. `to.merge <- rbind(to.merge, list(x=2, y=1, time=9.5, val='f'))` and contrast rolling with 1 and -1. It occurs (say for `roll=1`) when you have an invalid value before (as 9.5 is < 11 but differs by more than 1) and a valid value after (the 11.6). `roll=1` seems to notice that the before row is invalid so it puts `NA`, not even checking the after row. Would you mind putting up your convoluted approach again? (the data.table solution is way way way faster than the `merge` one, though I have spent much longer trying to understand it) – mathematical.coffee Apr 19 '13 at 06:26
1

@mathematical.coffee -- i've added it, and replaced roll=1 with roll=-1 in the original solution (as =1 did not appear to work (as you noticed). `-1` seems more robust, but I don't know why! – mnel Apr 19 '13 at 07:01
1

@mnel cheers! (you can get the same problem rolling with -1 if you have a valid value before and an invalid after - I think the only fully robust solution is to roll with +1 and -1 and merge the NAs as you did) – mathematical.coffee Apr 19 '13 at 12:24
perhaps update/"modernize" this with the `on` argument? – MichaelChirico Jul 17 '16 at 15:21

score 6 · Accepted Answer · answered Apr 19 '13 at 02:26

Using merge couple of times and aggregate once, here is how to do it.

set.seed(1)
df <- cbind(expand.grid(x = 1:3, y = 1:5), time = round(runif(15) * 30))
to.merge <- data.frame(x = c(2, 2, 2, 3, 2), y = c(1, 1, 1, 5, 4), time = c(17, 12, 11.6, 22.5, 2), val = letters[1:5], stringsAsFactors = F)

#Find rows that match by x and y
res <- merge(to.merge, df, by = c("x", "y"), all.x = TRUE)
res$dif <- abs(res$time.x - res$time.y)
res
##   x y time.x val time.y dif
## 1 2 1   17.0   a     11 6.0
## 2 2 1   12.0   b     11 1.0
## 3 2 1   11.6   c     11 0.6
## 4 2 4    2.0   e      6 4.0
## 5 3 5   22.5   d     23 0.5

#Find rows that need to be merged
res1 <- merge(aggregate(dif ~ x + y, data = res, FUN = min), res)
res1
##   x y dif time.x val time.y
## 1 2 1 0.6   11.6   c     11
## 2 2 4 4.0    2.0   e      6
## 3 3 5 0.5   22.5   d     23

#Finally merge the result back into df
final <- merge(df, res1[res1$dif <= 1, c("x", "y", "val")], all.x = TRUE)
final
##    x y time  val
## 1  1 1    8 <NA>
## 2  1 2   27 <NA>
## 3  1 3   28 <NA>
## 4  1 4    2 <NA>
## 5  1 5   21 <NA>
## 6  2 1   11    c
## 7  2 2    6 <NA>
## 8  2 3   20 <NA>
## 9  2 4    6 <NA>
## 10 2 5   12 <NA>
## 11 3 1   17 <NA>
## 12 3 2   27 <NA>
## 13 3 3   19 <NA>
## 14 3 4    5 <NA>
## 15 3 5   23    d

Your row 9 shouldn't be there, because the time in `df` is 6 and the time in `to.merge` is 2, and these differ by more than 1 — mathematical.coffee, Apr 19 '13 at 02:28
thankyou! very clever using the multiple `merge`, and I've never used `aggregate` before. Also, the `all.x` is not necessary your first `merge` I believe. — mathematical.coffee, Apr 19 '13 at 02:35

score 1 · Answer 3 · answered Jul 19 '18 at 12:05

mnel's answer uses roll = "nearest" in a data.table join but does not limit to +/- 1 as requested by the OP. In addition, MichaelChirico has suggested to use the on parameter.

This approach uses

roll = "nearest",
an update by reference, i.e., without copying,
setDT() to coerce a data.frame to data.table without copying (introduced 2014-02-27 with v.1.9.2 of data.table),
the on parameter which spares to set a key explicitely (introduced 2015-09-19 with v.1.9.6).

So, the code below

library(data.table)   # version 1.11.4 used
setDT(df)[setDT(to.merge), on  = .(x, y, time), roll = "nearest",
          val := replace(val, abs(x.time - i.time) > 1, NA)]
df

has updated df:

    x y time  val
 1: 1 1    8 <NA>
 2: 2 1   11    c
 3: 3 1   17 <NA>
 4: 1 2   27 <NA>
 5: 2 2    6 <NA>
 6: 3 2   27 <NA>
 7: 1 3   28 <NA>
 8: 2 3   20 <NA>
 9: 3 3   19 <NA>
10: 1 4    2 <NA>
11: 2 4    6 <NA>
12: 3 4    5 <NA>
13: 1 5   21 <NA>
14: 2 5   12 <NA>
15: 3 5   23    d

Note that the order of rows has not been changed (in contrast to Chinmay Patil's answer)

In case df must not be changed, a new data.table can be created by

result <- setDT(to.merge)[setDT(df), on  = .(x, y, time), roll = "nearest",
                .(x, y, time, val = replace(val, abs(x.time - i.time) > 1, NA))]
result

which returns the same result as above.

Merge dataframes on matching A, B and closest C?

3 Answers3

Linked

Merge dataframes on matching A, B and *closest* C?

3 Answers3

Linked

Merge dataframes on matching A, B and closest C?