I have a bunch of customer data that is normalized into multiple tables. I want to decide the best criteria for make a best guess that a customer might be the same. There needs to be a balance between minimizing the number of duplicates but also minimizing the false positives and therefore interrupting users to ask about potential dupes.
I am looking at some combination of first/last name + phone number || email address.
The first question is, what is a good set of criteria for determining if a customer might be the same as another customer.
The second question is, for this specific application, I only want to detect duplicates for customers that have signed up within the last 2 months or so. Does this change the detection criteria at all?