we are using RLBiggDataLinkage in R for linking two records 1. Master data (~1.6 million records) 2. target (~100k records)
Columns are first name, last name, address, zip, unique id1, unique id 2
unique ids are not available for all records in both data sets. although when available, they should be given highest importance.
we are using fsWeights in order to supply m and u probability and cut offs, in order to have a fixed weight for a pattern of match
compare_fs <- fsWeights(compare_lm, m = c(0.99,0.99,0.8,0.6,0.9,0.6,0.99,0.99,0.85), u = c(0.000001,0.000008,0.00003,0.00003,0.8,0.009,0.000001,0.000001,0.000003), cutoff = c(1,0.95,0.95,0.9,1,0.98,0.99,0.99,1) )
we are using string comparison on all our columns and blocking on first 3 characters of first name (this is done to avoid missing the pairs which have spelling mistakes in first name)
compare_lm <- RLBigDataLinkage(master_lm, target_lm, blockfld = c("FIRST_NAME_3"),strcmp = c( "FIRST_NAME", "LAST_NAME", "ADDRESS1","ZIP_OR_POSTAL_CODE", "UNIQUEID_1" ,"UNIQUEID_2"), strcmpfun = "jarowinkler", exclude=c("ID") )
Our condition of match is when, id's are available in both record pairs, then atleast 1 id should match or else
first name, last name, address should match.
with fsweights, we are getting different weights for similar pairs eg pair 1 : weight 27.33 . first name, last name, address match. identfiers are null in master data. Correct match
AMMANARI ASSEVERO 71RATHERAVESTE130STE130 12534 NA
AMMANARIA ASSEVERO 71RATHERAVE 12534 AASSEVERO@CMH-NET.ORG 761669 523783006
pair 2 : weight 27.33. only first name, last name match. address and identifiers mismatch wrong match
JOHN SOOK 1532SULTANAVE 70112 nursejd@cox.net 67541 740753012 JOHN SOOKE 201LYONSAVE 7112 SRSOOKE314@GMAIL.COM 9110520 350169181
Pair 3: weight 42 . first name, last name, address match. identfiers are null in master data.
BORGES TENCIA 2608ERESIDENTIALBLVD 33344 NA
BORGES TENCIA 2608ERESIDENTIALBLVD 33344 BORGES.TENCIA@HOLY-CROSS.COM 1519647 3008480850
Pair 1 and 3 should be getting same weight as per given m, u and cut off
How can we increase the weight of pair 1 or decrease the weight of pair 2 to ensure we have all correct matches.