I would like to accomplish the following using ffdf: Merge on columns X and Y and closest Time and then merge on the closes column B. However,the procedure that I know in smaller samples involves using outer merges (as shown below). What is a way around this for a large sample that won't fit in memory (and probably wouldn't work on sqldf), using ffbase? If not possible, what would be the best library for this?
As a reproducible example, same as below:
set.seed(1)
df.ff <- as.ffdf(cbind(expand.grid(x = 1:3, y = 1:5), time = round(runif(15) * 30)))
to.merge.ff <- as.ffdf(data.frame(x = c(2, 2, 2, 3, 2), y = c(1, 1, 1, 5, 4), time = c(17, 12, 11.6, 22.5, 2), val = letters[1:5], stringsAsFactors = F))
I borrow the following example from @ChinmayPatil here to highlight the similar procedure I would like to follow: (R - merge dataframes on matching A, B and *closest* C?):
require(data.table)
set.seed(1)
df <- setDT(cbind(expand.grid(x = 1:3, y = 1:5), time = round(runif(15) * 30)))
to.merge <- setDT(data.frame(x = c(2, 2, 2, 3, 2), y = c(1, 1, 1, 5, 4), time = c(17, 12, 11.6, 22.5, 2), val = letters[1:5], stringsAsFactors = F))
## First do a left outer merge
A <- merge(to.merge,df, by = c('x','y'), all.x = T )
## Then calculate a diff row as such
A$diff <- abs(A$time.x - A$time.y)
##then take the minimum distance
A[ , .I[which.min(diff)] , by = c('x', 'y' ) ]