Python Record Linkage, Fuzzy Match and Deduplication

Question

I have 3 dataset of customers with 7 columns.

CustomerName

Address

Phone

StoreName

Mobile

Longitude

Latitude

every dataset has 13000-18000 record. I am trying to fuzzy match for deduplication between them. my data set columns don't have same weight in this matching. How i can handle it???? Do you know good library for my case?

Yes Dedupe will work here, just merge 3 datasets into one and run thru the dedupe to get the required clusters for the probable duplicates, I've used dedupe extensively for such kind of task. — min2bro, May 22 '18 at 07:21

score 3 · Accepted Answer · answered Jan 17 '19 at 15:56

I think Recordlinkage library would suit your purposes

you can use to the Compare object , requiring various kinds of matches:

compare_cl.exact('CustomerName', 'CustomerName', label='CustomerName')
compare_cl.string('StoreName', 'StoreName', method='jarowinkler', threshold=0.85, label='surname')
compare_cl.string('Address', 'Address', threshold=0.85, label='Address')

then defining the match you can customize how you want results, ie if you want 2 features to be matched at least

features = compare_cl.compute(pairs, df)    
matches = features[features.sum(axis=1) > 3]

Python Record Linkage, Fuzzy Match and Deduplication

1 Answers1