Use Python dedupe library to return all matches against messy dataset

Question

First, if you haven't seen the Dedupe library for Python: it's awesome. Much like TensorFlow, it's a great way to bring machine learning to the masses (like me).

I'm trying to do record linkage of names against a single, large, messy data set. I'm using heuristics right now, and it's starting to fall short with more complicated data sets.

Questions:

Is there a way to perform a match of a single record (one-by-one or in batches) and return all the potential matches?

Gazetteer docs say one side must be clean, no duplicates. If names can be duplicated but serial numbers aren't (and serial numbers aren't used in matching) isn't that a duplicate?

Context:

There are 1.6M specialized construction machines in the US. There is a database with the machine type, owner names (up to two, companies included), serial number, and maintenance information like last_service_date.

People often inquire about maintenance and sales of their machines (100-250/day), and I keep a running record. The problem is matching the name on the phone with the machine(s) that they own. I need to match the names I have on the forms with the names on the ownership records to learn more about the machine after the fact and understand the lifecycle of the machines.

Sample Data:

"""
 This is simplified data. We often have two names on the form, and owner names
 come in first_name, last_name format but are often split in strange ways when
 multiple owners have a single machine.
"""
# Incoming Record (100-250+ per day)
{
'raw_name': 'Maria C Hernandez', 'inquire_date': '2017-11-16', 'inquire_type': 'sale'
}

# Ownership Records (1.6M+, with duplicates of NAME but not SERIAL #)
[
{'owner_1': 'HECTOR & MARIANNE HERNANDEZ', 'owner_2': '', 'serial': '3993892k'},
{'owner_1': 'MARIANA HERNANDEZ', 'owner_2': '', 'serial': '8383883hh'},
{'owner_1': 'MARIA HERNANDEZ', 'owner_2': 'TAMMY ULMER', 'serial': '123fdfe'},
{'owner_1': 'JOSE & MARIA HERNANDEZ', 'owner_2': 'MH CORP', 'serial': '223466y4'},
{'owner_1': 'MARIA C HERNANDEZ', 'owner_2': 'HIPOLITO HERNANDEZ', 'serial': '2433ff3345'},
]

Maybe I need some guidance, as well... For our heuristics, I essentially split the name fields in both data sets and compare them in 6 or 7 different ways. Now we are getting inquiries with multiple names that could help matching. Maybe more heuristics would work, but this tool seems perfect for the job.

score 1 · Answer 1 · answered Nov 17 '17 at 04:30

1

You may use string metric for one by one analysis. But checking each record even is computationally not very efficient, since you will be doing something similar to full scan. Using string metric you can combine strings and assign weights to it. For example: combine the names and phone numbers, which also helps avoid real duplicates (If you have two entries for the same person) as the combination will be a unique string. Either you can formulate ways to assign weights to it or let dedupe calculate the weight using “Active learning”.

Please use the below documentation for details.

https://dedupe.io/developers/library/en/latest/Matching-records.html

answered Nov 17 '17 at 04:30

codeslord

2,172
14
20

Thanks - for your thoughts and time to reply!! Combining strings might be one answer, but truthfully the power of this library seems to come from its ability compare across variables and do logistic regression. Do you have any ideas on how to use that? – PANDA Stack Nov 18 '17 at 20:24
@PANDAStack I cannot claim much expertise in dedupe, could you please check the following documentation. Going through the first few pages itself will help you a lot. https://media.readthedocs.org/pdf/dedupe/latest/dedupe.pdf. – codeslord Nov 19 '17 at 11:17
One thing I'm trying to find the answer to is if the fields of a record in a gazette must be unique when taken in combination. – PANDA Stack Nov 21 '17 at 04:39

score 1 · Accepted Answer · answered Nov 26 '17 at 04:18

1

This is a good use case for the Gazetteer class. I'm not sure why you think this is not appropriate?

(I am the primary author of dedupe)

answered Nov 26 '17 at 04:18

fgregg

3,173
30
37

1

I think it's confusion on my part - when I read that Gazetteer needed to be "pre-deduped" or canonical. In my list there might be 12 machines that have the same owner string in field `owner_1`. I ended up using Gazetteer, though. In my case I decided to finally use fuzzywuzzy and affinegap to block probable matches and then apply some heuristics instead. – PANDA Stack Jan 08 '18 at 14:37
I am using StaticDedupe for deduping. However, how do I know which field caused a match for a certain incoming record? – Eswar Jun 02 '21 at 06:51

Use Python dedupe library to return all matches against messy dataset

2 Answers2