1

I am trying to find an efficient way of grouping duplicate contacts together based on any 3 criteria: Email, Phone number & name, account number & name and creating a master_contact_id. Here is an exaple to illustrate what I am trying to acheive:

Data looks like this:

ContactID   Name           Email               Phone&Name               Account&Name
12345       Bob Smith      Bob@ABC.com         234-243-2432Bob Smith    A1234Bob Smith
42023       Bob Smith      Bob01@ABC.com       234-243-2432Bob Smith    B1234Bob Smith
50203       Bob S.         Bob@ABC.com         234-243-2432Bob S.       Z1234Bob S.
20394       Clara Sakshi   Clara@Sakshi.com    123-123-1234Clara Sakshi Q1231Clara Sakshi
29930       Clara Sakshi   Clara@ABC.com       234-243-2432Clara Sakshi A1234Clara Sakshi
92303       Clara Sakshi   Clara01@Sakshi.com  999-999-1234Clara Sakshi Q1231Clara Sakshi

Desired output:

Master ContactID   ContactID    Notes (not part of output):
1                  12345        related to 50203 by email match
1                  42023        related to 12345 by name and number match
1                  50203        related to 12345 by email match
2                  20394        related to 92303 by account number and name match
3                  29930        Not related to any other Contacts
2                  92303        related to 20394 by account number and name match

I have acheived the desired output by unpivoting the contacts table in SQL and then applying the graph walking technique described here: Grouping 'groups' with common element

Unfortunately the runtime of this solution is not viable. It took almost one hour to run a sample of 1000 records. Runtime increases exponentially as the data set increases and I have over 250000 contacts.

Any insights as to how to acheve this more efficiently (either in SQL or Python) would be greatly appreciated.

Please take note that I am essentially a beginner in SQL and I just started dabbleing in python in the hopes of finding an alternate solution.

Thank you

Luc

Montrealer
  • 11
  • 2

1 Answers1

0

You wish to dedup records, efficiently.

Let's start with Email. Create an email_report table with PK of Email and UNIQUE KEY of ContactID. You can produce it using a simple GROUP BY. Take care to also ORDER BY ContactID and then use the MIN(ContactID) so we prefer the first ContactID added to the dataset, rather than subsequent dups. Any "unique constraint violated" messages that crop up here correspond to data cleaning issues, and it's probably best to just discard that handful of records.

Now each Email in the dataset maps to exactly one canonical ContactID, 1:1.

Repeat for Phone&Name. And for Account&Name. Giving us a base table plus three report tables. (We haven't set up FK relations but we certainly could.)

You showed example output with, roughly: SELECT mail, phone, acct. We are now in a position to produce those same three columns, via JOINs, but expressed using the common term of ContactID. And then it's a business rule, which you haven't touched on, for what to do with the ambiguous case of the three ContactIDs not matching. Maybe we do 2 out of 3 majority wins on relevant rows? Maybe every distinct tuple should spin up a new distinct MasterContactID? Perhaps we simply compute MIN( ... ) across the 3 IDs? Up to you. Given the indexing, computing such results across quarter million rows will definitely go very fast. Read those rows into a python script and apply the business logic you prefer.


Normalizing all text values beforehand probably wouldn't hurt. Lowercasing is straightforward. Maybe delete or normalize umlauts to one of three corresponding vowels. Turn punctuation like "-" and "." into SPACE, and then compress repeated spaces down to a single one. There's a whole range of Phonetic algorithms that python could apply, including Metaphone, and postgres can even apply Soundex within a SELECT.


The sql should resemble something like this. (Or use a CTE, though debugging would be less convenient.)

CREATE TABLE email_report (
    email      TEXT     PRIMARY KEY,
    contactId  INTEGER  NOT NULL,
    UNIQUE (contact_id)
);
INSERT INTO email_report
SELECT    email, MIN(contactId) AS contactId
FROM      base_table
GROUP BY  email
ORDER BY  email, contactId
;

Also phone, and account. Then report on them:

SELECT    e.contactId AS email_cid, ...     -- also phone, and account.
FROM      base_table b
JOIN      email_report e  ON b.email = e.email
JOIN                                        --- also phone
JOIN                                        --- and account
J_H
  • 17,926
  • 4
  • 24
  • 44
  • Hi J_H, thank you for your answer. Perhaps I was not sufficiently clear in my post. The first table with the emails, phone and accounts is the raw data I am trying to transform. The second table, that shows "Master ContactID" and "ContactID" is the desired output. – Montrealer Mar 14 '23 at 12:35