2

I'm using the Dedupe library to match person records to each other. My data includes name, date of birth, address, phone number and other personally identifying information.

Here is my question: I always want to match two records with 100% confidence if they have a matching name and phone number (for example).

Here is an example of some of my code:

fields = [
    {'field' : 'LAST_NM', 'variable name' : 'last_nm', 'type': 'String'},
    {'field' : 'FRST_NM', 'variable name' : 'frst_nm', 'type': 'String'},
    {'field' : 'FULL_NM', 'variable name' : 'full_nm', 'type': 'Name'},
    {'field' : 'BRTH_DT', 'variable name' : 'brth_dt', 'type': 'String'},
    {'field' : 'SEX_CD', 'type': 'Exact'},
    {'field' : 'FULL_US_ADDRESS', 'variable name' : 'us_address', 'type': 'Address'},
    {'field' : 'APT_NUM', 'type': 'Exact'},
    {'field' : 'CITY', 'type': 'ShortString'},
    {'field' : 'STATE', 'type': 'ShortString'},
    {'field' : 'ZIP_CD', 'type': 'ShortString'},
    {'field' : 'HOME_PHONE', 'variable name' : 'home_phone', 'type': 'Exact'},
    {'type': 'Interaction', 'interaction variables' : ['full_nm', 'home_phone']},

In the Dedupe library, is there any way for me to explicitly match two or more fields? According to the docs, "An interaction field multiplies the values of the multiple variables." (https://dedupe.readthedocs.org/en/latest/Variable-definition.html#interaction). I want to implement a strict rule that it matches with 100% confidence - not merely multiplying the values of the variables. The reason I ask is that I have found that occasionally Dedupe misses some matches on these two criteria (likely a result of me not training long enough, but regardless, I just want to hard code these matches into my script).

Any suggestions?

fgregg
  • 3,173
  • 30
  • 37
blahblahblah
  • 2,299
  • 8
  • 45
  • 60

2 Answers2

5

Dedupe does not have this feature and probably never will (I'm one of the main authors). If it's truly a rule that exact matches on these fields means that records are co-referent, you can write some code to explicitly match these before sending the rest of the records into Dedupe.

exact_matches = defaultdict(list)
for record_id, record in records.items():
    match_key = (record['name'], record['phone'])
    exact_matches[match_key].append(record_id)

partially_deduplicated = []
exact_lookup = {}
for match_group in exact_matches.values():
     head_id = match_group.pop()
     partially_deduplicated.append((head_id, records[head_id]))
     for dupe_id in match_group :
         exact_lookup[dupe_id] = head_id
fgregg
  • 3,173
  • 30
  • 37
1

Set all the fields you want to match exactly to type 'exact' - for example:

{'field' : 'FULL_NM', 'variable name' : 'full_nm', 'type': 'Exact'},
  • Thanks, @barny. To clarify, I want Dedupe to work its magic in addition to the few conditions where matches are automatically made. If I understand Dedupe, the probability of two rows matching is a combination of the probabilities across all the fields. For example, if I set HOME_PHONE to 'Exact', two records with the same phone number may not always match if the name and birth date are significantly different. I don't think that just changing FULL_NM to 'Exact' will create a rule where all records with name and phone number matches will match because other fields may be very different. – blahblahblah Sep 13 '15 at 16:23
  • You already have the interaction configured for home phone and full name - and the phone is already set to exact so adding making the full name an exact match should do the trick. Give an example of what dedupe is failing to spot - is it always missing the same combination? – DisappointedByUnaccountableMod Sep 13 '15 at 20:23
  • have you tried an initial rule set of fields with just name and phone number, both set to exact, and probably an interaction referencing them both? If doing exact matches you don't really need dedupe - you could pre-process to find those record and remove them from subsequent dedupe processing. – DisappointedByUnaccountableMod Sep 13 '15 at 22:44
  • I just tested again and even after setting name and phone number to 'Exact' with an interaction, Dedupe does not match some records (likely because other fields like address, birth date and others do not match). I think your most recent comment is the best approach. I think I just need to pre-process first and then run dedupe on the rest of the records. I was just hoping I could do both at once! – blahblahblah Sep 14 '15 at 00:22