2

I have 3 dataset of customers with 7 columns.

CustomerName

Address

Phone

StoreName

Mobile

Longitude

Latitude

every dataset has 13000-18000 record. I am trying to fuzzy match for deduplication between them. my data set columns don't have same weight in this matching. How i can handle it???? Do you know good library for my case?

Community
  • 1
  • 1
Dr Sima
  • 135
  • 1
  • 12
  • @fgregg can i use dedupe for this case? – Dr Sima May 09 '18 at 09:21
  • 1
    Yes Dedupe will work here, just merge 3 datasets into one and run thru the dedupe to get the required clusters for the probable duplicates, I've used dedupe extensively for such kind of task. – min2bro May 22 '18 at 07:21

1 Answers1

3

I think Recordlinkage library would suit your purposes

you can use to the Compare object , requiring various kinds of matches:

compare_cl.exact('CustomerName', 'CustomerName', label='CustomerName')
compare_cl.string('StoreName', 'StoreName', method='jarowinkler', threshold=0.85, label='surname')
compare_cl.string('Address', 'Address', threshold=0.85, label='Address')

then defining the match you can customize how you want results, ie if you want 2 features to be matched at least

features = compare_cl.compute(pairs, df)    
matches = features[features.sum(axis=1) > 3]