2

I have a bunch of customer data that is normalized into multiple tables. I want to decide the best criteria for make a best guess that a customer might be the same. There needs to be a balance between minimizing the number of duplicates but also minimizing the false positives and therefore interrupting users to ask about potential dupes.

I am looking at some combination of first/last name + phone number || email address.

The first question is, what is a good set of criteria for determining if a customer might be the same as another customer.

The second question is, for this specific application, I only want to detect duplicates for customers that have signed up within the last 2 months or so. Does this change the detection criteria at all?

Christopher Martin
  • 927
  • 1
  • 7
  • 9

3 Answers3

1

How would you go about asking a customer if they are the owner of a duplicate accoount?

"Hey Sam Jones, there is another Sam Jones that has an ip in your local area, his email is sam.jones@abc.com and your latest registration had an email of sam.jones@apple.com, are you the same guy/girl?"

If the above is even close to your scenario, then you would be leaking private information. i.e. the other Sam Jone's email address.

Typically you don't allow a customer to signup with the same email address, and secondly you verify that the email address they do sign up with is valid. That way if they signup again with a mistype in the email, they can't validate it.

  • Also, I'm sure that getting around this dupplication check will be easy. Just create a new e-mail address and you're done. Maybe use the land line number for the first account and the GSM number for the second. Usually you just check if the username (if the customer needs one) and the e-mail allready exists. Maybe the postal address too and then show a message that there is allreay a customer with that name on the same address. But checking phonenumbers? I know a compeny that shares it number, how would you handle that? – Bernhard Apr 13 '12 at 00:12
  • In this case, there is an employee that enters customer information. They take the information from a customer and then would be presented with the potential duplicates information. Generally, there would be more information in common that would allude to the customer being the same as another. For example, the customer may have already given an address and a name that was spelled differently and a phone number, etc. It seems that there are cases when you would want to dedupe a customer's record. – Christopher Martin Apr 14 '12 at 06:45
  • But how do you do it without confirmation of the customer? I can see why Google does it in google contacts, but I entered that information, or I sourced it, so I know if it is a duplicate or not. – Matthew Parlane Apr 14 '12 at 08:48
0

An important thing is to choose attributes that are unlikely to change. If you use something like telephone number or email address, you risk having duplicates any time someone changes ISPs or mobile phone providers.

If these customers are customers that have made purchases in the past, you can store a hash of their credit card number and a hash of their billing address. Whenever they make another purchase, hash their payment info and compare it to your database. (notice I said to store a hash, NOT their actual payment info)

bta
  • 43,959
  • 6
  • 69
  • 99
  • You might find that if you store the hash of the credit card number, it's actually fairly easy to get back to the original value, which might be in breach of PCI rules. – Colin Breame Jun 21 '16 at 15:18
  • @ColinBreame that depends on the hash algorithm that you use. Cryptographically-strong hashes like SHA-256 are considered to be one-way and don't typically break data security rules, especially if salted. Hashes with known vulnerabilities (like MD5) might be a different story. – bta Jul 03 '16 at 16:44
0

if this question is of still interest to you, please check this tool https://sourceforge.net/projects/deduper/

I wrote this tool mainly for the purpose that you have mentioned in this question

vumaasha
  • 2,765
  • 4
  • 27
  • 41