I have two dataframes like so:
set.seed(1)
df <- cbind(expand.grid(x=1:3, y=1:5), time=round(runif(15)*30))
to.merge <- data.frame(x=c(2, 2, 2, 3, 2),
y=c(1, 1, 1, 5, 4),
time=c(17, 12, 11.6, 22.5, 2),
val=letters[1:5],
stringsAsFactors=F)
I want to merge to.merge
into df
(with all.x=T
) such that:
df$x == to.merge$x
ANDdf$y == to.merge$y
ANDabs(df$time - to.merge$time) <= 1
; in the case of multipleto.merge
that satisfy, we pick the one that minimises this distances.
How can I do this?
So my desired result is (this is just df
with the corresponding value
column of to.merge
added for matching rows):
x y time val
1 1 1 8 NA
2 2 1 11 c
3 3 1 17 NA
4 1 2 27 NA
5 2 2 6 NA
6 3 2 27 NA
7 1 3 28 NA
8 2 3 20 NA
9 3 3 19 NA
10 1 4 2 NA
11 2 4 6 NA
12 3 4 5 NA
13 1 5 21 NA
14 2 5 12 NA
15 3 5 23 d
where to.merge
was:
x y time val
1 2 1 17.0 a
2 2 1 12.0 b
3 2 1 11.6 c
4 3 5 22.5 d
5 2 4 2.0 e
Note - (2, 1, 17, a) didn't match into df
because the time
17 was more than 1 away from df$time
11 for (X, Y) = (2, 1).
Also, there were two rows in to.merge
that satisfied the condition for matching to df
's (2, 1, 11) row, but the 'c' row was picked instead of the 'b' row because its time
was the closest to 11.
Finally, there may be rows in to.merge
that do not match anything in df
.
One way that works is a for-loop, but it takes far too long for my data (df
has ~12k rows and to.merge
has ~250k rows)
df$value <- NA
for (i in 1:nrow(df)) {
row <- df[i, ]
idx <- which(row$x == to.merge$x &
row$y == to.merge$y &
abs(row$time - to.merge$time) <= 1)
if (length(idx)) {
j <- idx[which.min(row$time - to.merge$time[idx])]
df$val[i] <- to.merge$val[j]
}
}
I feel that I can somehow do a merge, like:
to.merge$closest_time_in_df <- sapply(to.merge$time,
function (tm) {
dts <- abs(tm - df$time)
# difference must be at most 1
if (min(dts) <= 1) {
df$time[which.min(dts)]
} else {
NA
}
})
merge(df, to.merge,
by.x=c('x', 'y', 'time'),
by.y=c('x', 'y', 'closest_time_in_df'),
all.x=T)
But this doesn't merge the (2, 1, 11)
row because to.merge$closest_time_in_df
for (2, 1, 11.5, c)
is 12, but a time of 12 in df
corresponds to (x, y) = (2, 5) not (2, 1) hence the merge fails.