I am trying to find an efficient way of grouping duplicate contacts together based on any 3 criteria: Email, Phone number & name, account number & name and creating a master_contact_id. Here is an exaple to illustrate what I am trying to acheive:
Data looks like this:
ContactID Name Email Phone&Name Account&Name
12345 Bob Smith Bob@ABC.com 234-243-2432Bob Smith A1234Bob Smith
42023 Bob Smith Bob01@ABC.com 234-243-2432Bob Smith B1234Bob Smith
50203 Bob S. Bob@ABC.com 234-243-2432Bob S. Z1234Bob S.
20394 Clara Sakshi Clara@Sakshi.com 123-123-1234Clara Sakshi Q1231Clara Sakshi
29930 Clara Sakshi Clara@ABC.com 234-243-2432Clara Sakshi A1234Clara Sakshi
92303 Clara Sakshi Clara01@Sakshi.com 999-999-1234Clara Sakshi Q1231Clara Sakshi
Desired output:
Master ContactID ContactID Notes (not part of output):
1 12345 related to 50203 by email match
1 42023 related to 12345 by name and number match
1 50203 related to 12345 by email match
2 20394 related to 92303 by account number and name match
3 29930 Not related to any other Contacts
2 92303 related to 20394 by account number and name match
I have acheived the desired output by unpivoting the contacts table in SQL and then applying the graph walking technique described here: Grouping 'groups' with common element
Unfortunately the runtime of this solution is not viable. It took almost one hour to run a sample of 1000 records. Runtime increases exponentially as the data set increases and I have over 250000 contacts.
Any insights as to how to acheve this more efficiently (either in SQL or Python) would be greatly appreciated.
Please take note that I am essentially a beginner in SQL and I just started dabbleing in python in the hopes of finding an alternate solution.
Thank you
Luc