0

we are using RLBiggDataLinkage in R for linking two records 1. Master data (~1.6 million records) 2. target (~100k records)

Columns are first name, last name, address, zip, unique id1, unique id 2

unique ids are not available for all records in both data sets. although when available, they should be given highest importance.

we are using fsWeights in order to supply m and u probability and cut offs, in order to have a fixed weight for a pattern of match

compare_fs <- fsWeights(compare_lm, m = c(0.99,0.99,0.8,0.6,0.9,0.6,0.99,0.99,0.85), u = c(0.000001,0.000008,0.00003,0.00003,0.8,0.009,0.000001,0.000001,0.000003), cutoff = c(1,0.95,0.95,0.9,1,0.98,0.99,0.99,1) )

we are using string comparison on all our columns and blocking on first 3 characters of first name (this is done to avoid missing the pairs which have spelling mistakes in first name)

compare_lm <- RLBigDataLinkage(master_lm, target_lm, blockfld = c("FIRST_NAME_3"),strcmp = c( "FIRST_NAME", "LAST_NAME", "ADDRESS1","ZIP_OR_POSTAL_CODE", "UNIQUEID_1" ,"UNIQUEID_2"), strcmpfun = "jarowinkler", exclude=c("ID") )

Our condition of match is when, id's are available in both record pairs, then atleast 1 id should match or else

first name, last name, address should match.

with fsweights, we are getting different weights for similar pairs eg pair 1 : weight 27.33 . first name, last name, address match. identfiers are null in master data. Correct match

AMMANARI ASSEVERO 71RATHERAVESTE130STE130 12534 NA
AMMANARIA ASSEVERO 71RATHERAVE 12534 AASSEVERO@CMH-NET.ORG 761669 523783006

pair 2 : weight 27.33. only first name, last name match. address and identifiers mismatch wrong match

JOHN SOOK 1532SULTANAVE 70112 nursejd@cox.net 67541 740753012 JOHN SOOKE 201LYONSAVE 7112 SRSOOKE314@GMAIL.COM 9110520 350169181

Pair 3: weight 42 . first name, last name, address match. identfiers are null in master data.

BORGES TENCIA 2608ERESIDENTIALBLVD 33344 NA
BORGES TENCIA 2608ERESIDENTIALBLVD 33344 BORGES.TENCIA@HOLY-CROSS.COM 1519647 3008480850

Pair 1 and 3 should be getting same weight as per given m, u and cut off

How can we increase the weight of pair 1 or decrease the weight of pair 2 to ensure we have all correct matches.

  • last column m, u and cut off is for First_3 (first 3 digits of first name) used for blocking – Swati Upadhyaya Dec 11 '16 at 11:31
  • we are getting a warning in the implementation which is leading to some records not being processed for cutoffs. this is why we get wrong weights for those records. Warning : In patterns[, colInd] < cutoff : longer object length is not a multiple of shorter object length – Swati Upadhyaya Dec 12 '16 at 09:52

0 Answers0