0

I have a column called 'website' from which I require to extract the domain part only

Examples

  1. www.google.com -> google.com
  2. www.google.com/field -> google.com
  3. https://www.nearbyplaces.com -> nearbyplaces.com
  4. http://hcc.ca -> hcc.ca
  5. http://hcc.ca/info -> hcc.ca
  6. http://hcc.ca/ -> hcc.ca

What I have done so far:


select distinct website, 
    CASE WHEN website like '%//www.%' THEN REPLACE(REGEXP_SUBSTR(website,'//[^/\\\,=@\\+]+\\.[^/:;,\\\\\(\\)]+'),'//www.','') 
         WHEN website like '%//%' THEN REPLACE(REGEXP_SUBSTR(website,'//[^/\\\,=@\\+]+\\.[^/:;,\\\\\(\\)]+'),'//','') 
         WHEN website is null then null
         WHEN website like '%www.%' THEN REPLACE(REGEXP_SUBSTR(website,'.([^/]*)'),'www.','')
         else website
         end as domain


Two Things: I am certain that I am missing out on some of the test cases. Secondly, I want to optimise the verbose in this solution. Any help in improving my snippet is appreciated. (FYI I am using Redshift )

ab_padfoot
  • 63
  • 1
  • 10
  • 1
    What about subdomains not called `www`? What about top level domains that have a second level hierarchy such as `google.co.uk`? You shouldn't be parsing domains with RegExp. I'm not familiar with Redshift, but this looks like SQL and I doubt you can reliably do what you want with Regexp and SQL. You need URL parser library. – RoToRa Apr 22 '22 at 09:50
  • @RoToRa I went back to see if my current solution worked on second-level hierarchy and here is the current output: ```https://www.dakotabar.co.nz/ returns dakotabar.co.nz https://www.law.cuny.edu/ returns law.cuny.edu``` – ab_padfoot Apr 22 '22 at 13:51

0 Answers0