2

I have a record linkage problem with very large datasets(2000 entries in the A-file, ~70.000.000 entries in the B-file) and want to do a distance-based matching with the jarow-winkler algorithm in R. Both files are data.tables filled with strings.

To develop my methodology I used subsamples and the package "RecordLinkage". The advantage of the package is that I can use blocking before the actual string comparison. The R-commands for that are

compare.linkage(dataset1, dataset2, strcmp, blockfld) 
RLBigDataLinkage(dataset1, dataset2, strcmp, blockfld)

The big disadvantage of this is that a comparison field for all entries in the A-file and all entries in the B-file is created which requires too much memory. Is there any way to do blocking and only keep n record pairs for each entry in the A-file with the best Jarow-Winkler scores?

To make things more understandable Ill give you a short example(I didnt use blocking for simplification):

library(RecordLinkage)
a <- as.matrix(c("ab", "ac", "ad", "aa"))
b <- as.matrix(c("bb", "bc", "bd", "bb"))
test <- compare.linkage(a, b)
str(test$pairs)
nrow(test$pairs)

My Problem is not that copies of "a" and "b" are included in "test", but the length of "test$pairs". In the above example "test$pairs" stores the comparison scores for all possible combinations of records. Thus there are 4*4=16 entries in "test$pairs". What Id like to do is to store only the n combinations for each element in the a-file with the best comparation score. So when I set n=2, I only get 4(records from a-file) * 2(record pairs with best comparation scores) = 8

This difference might be small in the above example, but is crucial for big data sets.

Thank you in advance!

C Krüger
  • 21
  • 2
  • If I understand correctly, your problem is that `RecordLinkage` stores a copy of your data set in the output object of `RLBigDataLinkage` and because of the size of your data sets this does not fit into memory. Right? On what type of variable are you blocking? And are you comparing only one column? Please show an example of your data (or some code generating an example). – Jan van der Laan Mar 04 '14 at 10:23
  • Thanks for your answer, apparently I havent described my problem well enough. So I added an example too make things clear. – C Krüger Mar 07 '14 at 12:54
  • When using `RLBigDataLinkage` the pairs are stored as an `ffdf` so there should be no memory issues. However, there still could be issues with size as the maximum length of an `ff` object is 2^31. So, I don't think the size of `test$pairs` is really the problem. However, in you case it would still be better (performance wise) to only generate the `n` best matching pairs. I don't think this is possible with `RecordLinkage`. I'll think about it. – Jan van der Laan Mar 07 '14 at 13:53
  • A quick look at the source for `compare.linkages` reveals that, with all due respect to the authors, it is a total mess and probably needs a complete rewrite. So there is definite scope for mempry optimisation, but it may take a bit of effort to do it properly. – Richie Cotton Mar 07 '14 at 14:07
  • Thank you for your comments, I will focus on alternative programmes/packages right now, e.g. the MergeToolBox(MTB) written in java. The MTB might be slower than the R-package, but has no memory issues at all. – C Krüger Mar 10 '14 at 09:12
  • Ive run both `RLBigdataLinkage` and `compare.linkage` to compare their performances. Even though `ff`-objects are more memory efficient, `RLBigdataLinkage` requires around ~6% more memory for calculation than `compare.linkage`. – C Krüger Mar 10 '14 at 09:18

0 Answers0