2

Scenario

For most of its history, my company used subdomains in the email addresses, mostly by state, but others had division subdomains. A few examples of what we had include:

mo.widgits.com
sd.widgits.com
va.widgits.com
nhq.widgits.com
gis.widgits.com
tech.widgits.com

...and so on.

New Paradigm

A few years ago, top management decided that they wanted us all to be one happy family; as part of this cultural realignment, they changed everyone's email addresses to the single domain, in the format of firstname.lastname@widgits.com.

Present Challenges

In many of our corporate databases, we find a mixture of records using either the old format and the new format. For example, the same individual might have porky.pig@widgits.com in the employee system, and porky.pig@in.widgits.com in the training system. I have a need to match individuals up among the various systems regardless of which format email is used for them in that system.

Desired Matches

porky.pig@in.widgits.com = porky.pig@widgits.com -> true
mary.poppins@widgits.com = mary.poppins@nhq.widgits.com -> true
bob.baker@widgits.com = bob.barker@gis.widgits.com -> false

How to Accomplish This?

Is there a regex pattern that I can use to match email addresses regardless of which format they are? Or will I need to manually extract out the subdomain before attempting to match them?

Michael Sheaver
  • 2,059
  • 5
  • 25
  • 38

2 Answers2

3

Off the top of my head, you could strip off the subdomain from all email addresses before comparing them (that is, compare only the email name and domain). Something like this:

SELECT *
FROM emails
WHERE REGEXP_REPLACE(email1, '^(.*@).*?([^.]+\.[^.]+)$', '\1\2') =
      REGEXP_REPLACE(email2, '^(.*@).*?([^.]+\.[^.]+)$', '\1\2');

screen capture from demo link below

Demo

Data:

WITH emails AS (
    SELECT 'porky.pig@in.widgits.com' AS email1, 'porky.pig@widgits.com' AS email2 UNION ALL
    SELECT 'mary.poppins@widgits.com', 'mary.poppins@nhq.widgits.com' UNION ALL
    SELECT 'bob.baker@widgits.com','bob.barker@gis.widgits.com'
)

Here is an explanation of the regex pattern used:

^                   start of the email
    (.*@)           match email name including @ in \1
    .*?             consume content up, but not including
    ([^.]+\.[^.]+)  final domain only (e.g. google.com)
$                   end of the email

Then, we replace with \1\2 to effectively remove any subdomain components.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • My goodness, you have given me some homework to do tomorrow! This is what I needed, but now I must learn how it works! I have been using regex in 'nix systems for decades, but this shows me that I have only scratch the surface! – Michael Sheaver Apr 07 '21 at 04:41
  • I was gonna try it first on my real data, but your demo does show it works. Many thanks for this elegant solution to my problem! – Michael Sheaver Apr 07 '21 at 04:46
1

How about something like this?

SELECT 
  * 
FROM 
  (
    SELECT 
      table1.email, 
      table2.email, 
      SPLIT_PART(table1.email, '@', 1) AS table1_username, 
      SPLIT_PART(table2.email, '@', 1) AS table2_username, 
      SPLIT_PART(table1.email, '@', 2) AS table1_domain, 
      SPLIT_PART(table2.email, '@', 2) AS table2_domain 
    FROM 
      table1 CROSS 
      JOIN table2
  ) S 
WHERE 
  (
    table1_username = table2_username 
    AND (
      table1_domain like '%.' || table2_domain 
      OR table2_domain like '%.' || table1_domain
    )
  );
Crash0v3rrid3
  • 518
  • 2
  • 6