Merge data.table by two variables and closest third

Question

I would like to perform an operation on data.tables, that I can currently, successfully do with data.frames. Essentially, it is a merge function of two data.frames, that finds the closest match in df2 for df1 for one of many matching variables. This code is below.

I would like to do this in data.tables, because my data.frames are very large, and my current setup crashes if I try to complete this operation on the full data. Data.table, might allow me to do it outright on the full set, but if not, I find data.table easier to work with when using multiple subsets of data.

I am looking for the Id (and its corresponding value) from df2 that has the closest match to a States value in df1 by the variables MM and variable (in this data.frame method, multiple pairings can occur if the there is a closest match tie (e.g. a value at both plus 1 and minus 1 exists)). When using data.frames I get the solution as final below. I don't know how to set up data.table to give me the same result. I have tried variation of my keys, one example is below. There is an answer using data.table in the data.frame question I reference in the code, however, I can not get it to work with my example data.

# data.frame method
# used info from this thread: https://stackoverflow.com/questions/16095680
df1 <- structure(list(State = structure(c(1L, 1L, 3L, 3L, 2L, 2L, 1L, 
1L, 1L), .Label = c("AK", "CO", "MS"), class = "factor"), MM = c(1L, 
2L, 1L, 2L, 3L, 4L, 3L, 4L, 2L), variable = structure(c(1L, 1L, 
1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("TMN", "TMX"), class = "factor"), 
    value = c(1L, 2L, 3L, 4L, 2L, 3L, 5L, 6L, 7L)), .Names = c("State", 
"MM", "variable", "value"), class = "data.frame", row.names = c(NA, 
-9L))
df2 <- structure(list(Id = c(1L, 2L, 3L, 1L, 2L, 3L, 5L, 6L, 7L, 5L, 
6L, 7L, 8L), MM = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 
4L, 5L), variable = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L), .Label = c("TMN", "TMX"), class = "factor"), 
    value = c(1, 2, 3, 2, 3, 4, 2, 3, 5.5, 6.5, 3.5, 2.5, 8)), .Names = c("Id", 
"MM", "variable", "value"), class = "data.frame", row.names = c(NA, 
-13L))

#Find rows that match by x and y
res <- merge(df1, df2, by = c("MM", "variable"), all.x = TRUE)

res$dif <- abs(res$value.x - res$value.y)

#Find rows that need to be merged
res1 <- merge(aggregate(dif ~ MM + variable, data = res, FUN = min), res)

#Finally merge the result back into df1
final <- merge(df1, res1[res1$dif <= 1, c("MM", "variable", "State", "Id", "value.y")], all.x = TRUE)

### one Data.table attempts
# create data.tables with the same key columns
keycols1 = c("MM", "variable", "value")
df1t <- data.table(df1, key = keycols1)
df2t <- data.table(df2, key = key(df1t))
setkey(df1t, value)
setkey(df2t, value)
test.final <- df2t[df1t, roll='nearest', allow.cartesian=TRUE]

The result in the data frame `final` in your example doesn't seem to match the description of what you are looking to get. For example, why does the combination (state=AK, variable=TMN, MM=1) produce two rows in `final`, should't it yield only one Id with the closest match? — ytsaig, Dec 09 '13 at 19:50
@YT Thanks, was missing `"State"` in the code for data.frame 'final' — nofunsally, Dec 09 '13 at 23:56

ytsaig · Accepted Answer · 2013-12-10T20:01:48.937

3

Not sure if this is the best way to achieve what you want, but here is one approach that is similar to what you do with data frames, only using data.tables instead:

dt1 <- data.table(df1)
dt2 <- data.table(df2)
res <- merge(dt1, dt2, by = c("MM", "variable"), all.x = TRUE, allow.cartesian=TRUE)
final_dt <- res[, .SD[abs(value.x - value.y) == min(abs(value.x - value.y))], by=c("State", "MM", "variable")]

Note that the result in final_dt differs from your result in final for (State=AK, MM=3, variable=TMX), where your approach above does not return a match even though according to your description a match should be returned.

edited Dec 10 '13 at 20:01

answered Dec 10 '13 at 07:16

ytsaig

3,267
3
23
27

`final_dt <- res[, .SD[abs(value.x - value.y) == min(abs(value.x - value.y))], by=c("State", "MM", "variable")]` Does this line translate as: take the subset of `res` where the value.x - value.y is the minimum? – nofunsally Dec 10 '13 at 17:43
Yes, if you add the by clause, i.e., it translates to: For each State,MM,variable combination, return the subset of rows in res where value.x - value.y is equal to the minimum value (the latter bit ensures that you can get multiple hits for each by group if there is more than one difference equal to the minimum). – ytsaig Dec 10 '13 at 19:46

Merge data.table by two variables and closest third

1 Answers1