Regex to extract domain from website column

Question

I have a column called 'website' from which I require to extract the domain part only

Examples

www.google.com -> google.com
www.google.com/field -> google.com
https://www.nearbyplaces.com -> nearbyplaces.com
http://hcc.ca -> hcc.ca
http://hcc.ca/info -> hcc.ca
http://hcc.ca/ -> hcc.ca

What I have done so far:


select distinct website, 
    CASE WHEN website like '%//www.%' THEN REPLACE(REGEXP_SUBSTR(website,'//[^/\\\,=@\\+]+\\.[^/:;,\\\\\(\\)]+'),'//www.','') 
         WHEN website like '%//%' THEN REPLACE(REGEXP_SUBSTR(website,'//[^/\\\,=@\\+]+\\.[^/:;,\\\\\(\\)]+'),'//','') 
         WHEN website is null then null
         WHEN website like '%www.%' THEN REPLACE(REGEXP_SUBSTR(website,'.([^/]*)'),'www.','')
         else website
         end as domain

Two Things: I am certain that I am missing out on some of the test cases. Secondly, I want to optimise the verbose in this solution. Any help in improving my snippet is appreciated. (FYI I am using Redshift )

What about subdomains not called `www`? What about top level domains that have a second level hierarchy such as `google.co.uk`? You shouldn't be parsing domains with RegExp. I'm not familiar with Redshift, but this looks like SQL and I doubt you can reliably do what you want with Regexp and SQL. You need URL parser library. — RoToRa, Apr 22 '22 at 09:50
@RoToRa I went back to see if my current solution worked on second-level hierarchy and here is the current output: ```https://www.dakotabar.co.nz/ returns dakotabar.co.nz https://www.law.cuny.edu/ returns law.cuny.edu``` — ab_padfoot, Apr 22 '22 at 13:51

Regex to extract domain from website column

0 Answers0