0

I'm using the Dedupe library to match person records to each other. My data includes first_name,last_name, email,phone1,phone2,phone3 and address information.

Here is my question: I always want to match two records with 80% to 99% confidence if they have a matching first_name,last_name with (phone1,phone2,phone3,email and address) also i want to match cross phone number like phone1=phone2,phone1=phone3,phone2=phone3.

Here is an example of some of my code:

fields = [
{'field' : 'first_name','variable name': 'ffname','type': 'Exact'},
{'field' : 'last_name','variable name': 'lname','type': 'Exact'},
{'field' : 'email','variable name': 'email', 'type': 'Exact','Has Missing':True},
{'field' : 'phone1','variable name': 'phone1', 'type': 'Exact', 'Has Missing':True},
{'field' : 'phone2','variable name': 'phone2', 'type': 'Exact', 'Has Missing':True},
{'field' : 'phone3','variable name': 'phone3', 'type': 'Exact', 'Has Missing':True},
{'field' : 'address','variable name': 'addr','type': 'String','Has Missing':True}    
]

In the Dedupe library, is there any way for me to match cross phone number with first_name and last_name?

shael
  • 177
  • 9

1 Answers1

1

Looking at the documentation, there are two ways of doing that.

The first one is tho use the set variable type.. The catch - set is similar to text in the way it compares strings - it looks at common terms, so from that perspective the phone numbers (123) 456-7890 is not the same as 4567890.

The other alternative, which I believe is better, is to build a custom comparator. This comparator would take two lists of phone numbers and return a number. The lower the number, the better. This comparator can be based on the affine comparison algorithm which is already used for string variables. Here's an implementation:

from affinegap import normalizedAffineGapDistance as affineGap

def phonesComparator(f1, f2):
    distances = []

    for p1 in f1: 
        for p2 in f2:
            distances.append(affineGap(p1, p2))
    if distances:
        return min(distances) 
    else:
        return 200.0

Here's I'm returning the minimum distance between any two phone numbers in the two lists. But - one can of course come up with alternative measures.

One final note: when creating the records, one should place all the phones in a single field. That list should be a list of phone numbers (or the empty list if there are none).

Roy2012
  • 11,755
  • 2
  • 22
  • 35
  • Do you find this solution useful? – Roy2012 Jun 13 '20 at 03:20
  • yes but also need with a name match is it work with a name? – shael Jun 15 '20 at 05:31
  • 1
    Sure. This is just one field. You can have multiple fields: one for first name, one for last name, one for street, and one for *multiple* phone numbers. – Roy2012 Jun 15 '20 at 05:33
  • I need Exact match phone number with cross phones with name match so how can I train the model – shael Jun 16 '20 at 09:20
  • 1
    not sure I understand. When you train the model, the system identifies potentially similar records (based on name, phone, etc), and asks whether they are the same or not. Could you please elaborate on what you mean (can do that over chat as well). – Roy2012 Jun 16 '20 at 09:43