I have a record linkage problem with very large datasets(2000 entries in the A-file, ~70.000.000 entries in the B-file) and want to do a distance-based matching with the jarow-winkler algorithm in R. Both files are data.tables filled with strings.
To develop my methodology I used subsamples and the package "RecordLinkage". The advantage of the package is that I can use blocking before the actual string comparison. The R-commands for that are
compare.linkage(dataset1, dataset2, strcmp, blockfld)
RLBigDataLinkage(dataset1, dataset2, strcmp, blockfld)
The big disadvantage of this is that a comparison field for all entries in the A-file and all entries in the B-file is created which requires too much memory. Is there any way to do blocking and only keep n record pairs for each entry in the A-file with the best Jarow-Winkler scores?
To make things more understandable Ill give you a short example(I didnt use blocking for simplification):
library(RecordLinkage)
a <- as.matrix(c("ab", "ac", "ad", "aa"))
b <- as.matrix(c("bb", "bc", "bd", "bb"))
test <- compare.linkage(a, b)
str(test$pairs)
nrow(test$pairs)
My Problem is not that copies of "a" and "b" are included in "test", but the length of "test$pairs". In the above example "test$pairs" stores the comparison scores for all possible combinations of records. Thus there are 4*4=16 entries in "test$pairs". What Id like to do is to store only the n combinations for each element in the a-file with the best comparation score. So when I set n=2, I only get 4(records from a-file) * 2(record pairs with best comparation scores) = 8
This difference might be small in the above example, but is crucial for big data sets.
Thank you in advance!