I would like to perform an operation on data.tables, that I can currently, successfully do with data.frames. Essentially, it is a merge function of two data.frames, that finds the closest match in df2 for df1 for one of many matching variables. This code is below.
I would like to do this in data.tables, because my data.frames are very large, and my current setup crashes if I try to complete this operation on the full data. Data.table, might allow me to do it outright on the full set, but if not, I find data.table easier to work with when using multiple subsets of data.
I am looking for the Id
(and its corresponding value
) from df2 that has the closest match to a State
s value
in df1 by the variables MM
and variable
(in this data.frame method, multiple pairings can occur if the there is a closest match tie (e.g. a value at both plus 1 and minus 1 exists)). When using data.frames I get the solution as final
below. I don't know how to set up data.table to give me the same result. I have tried variation of my keys, one example is below. There is an answer using data.table in the data.frame question I reference in the code, however, I can not get it to work with my example data.
# data.frame method
# used info from this thread: https://stackoverflow.com/questions/16095680
df1 <- structure(list(State = structure(c(1L, 1L, 3L, 3L, 2L, 2L, 1L,
1L, 1L), .Label = c("AK", "CO", "MS"), class = "factor"), MM = c(1L,
2L, 1L, 2L, 3L, 4L, 3L, 4L, 2L), variable = structure(c(1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("TMN", "TMX"), class = "factor"),
value = c(1L, 2L, 3L, 4L, 2L, 3L, 5L, 6L, 7L)), .Names = c("State",
"MM", "variable", "value"), class = "data.frame", row.names = c(NA,
-9L))
df2 <- structure(list(Id = c(1L, 2L, 3L, 1L, 2L, 3L, 5L, 6L, 7L, 5L,
6L, 7L, 8L), MM = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L,
4L, 5L), variable = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("TMN", "TMX"), class = "factor"),
value = c(1, 2, 3, 2, 3, 4, 2, 3, 5.5, 6.5, 3.5, 2.5, 8)), .Names = c("Id",
"MM", "variable", "value"), class = "data.frame", row.names = c(NA,
-13L))
#Find rows that match by x and y
res <- merge(df1, df2, by = c("MM", "variable"), all.x = TRUE)
res$dif <- abs(res$value.x - res$value.y)
#Find rows that need to be merged
res1 <- merge(aggregate(dif ~ MM + variable, data = res, FUN = min), res)
#Finally merge the result back into df1
final <- merge(df1, res1[res1$dif <= 1, c("MM", "variable", "State", "Id", "value.y")], all.x = TRUE)
### one Data.table attempts
# create data.tables with the same key columns
keycols1 = c("MM", "variable", "value")
df1t <- data.table(df1, key = keycols1)
df2t <- data.table(df2, key = key(df1t))
setkey(df1t, value)
setkey(df2t, value)
test.final <- df2t[df1t, roll='nearest', allow.cartesian=TRUE]