3

I want to take a relatively modest set of URLs and resolve them down to the top level domains using one (or more, if required) PostgreSQL queries.

It seems like the major steps to do this are as follows:

  1. find the number and position of all the "/" in the URL
  2. If the number of '/' is > 3, and then delete everything to the right of the third '/' (including the last '/')
  3. Count the number of '.' that appear in the URL that remains from steps 1-2.
  4. If the # of '.' is 1, then simply remove whatever came before the '://'.
  5. If the # of '.' > 1, then find the largest '.' and then extract the text between the first "." and the new max length of string.

I have been able to find a few examples: (a) http://www.postgresql.org/message-id/247444.36947.qm@web50311.mail.re2.yahoo.com (b) http://www.seanbehan.com/extract-domain-names-from-links-in-text-with-postgres-and-a-single-sql-query

But none of these seems to work correctly -- I am querying a redshift database, and I get a 'Function not Implemented' error when I attempt to execute.

While there are ways to do this in Python or other languages, I haven't been able to find a SO solution specifically for PostgreSQL.

user1601238
  • 41
  • 1
  • 4

2 Answers2

1

Assuming that your urls have a scheme, did you try something like:

select substring( 'http://www.arandomsite.com' from '^[^:]*://(?:[^/:]*:[^/@]*@)?(?:[^/:.]*\.)+([^:/]+)' ) as tld;

details:

^        # anchor for the start of the string
[^:]*:// # the scheme
(?:[^/:]*:[^/@]*@)? # optional "user:password@"
(?:[^/:.]*\.)+ # other parts of the hostname
([^:/]+) # tld (note that the ":" is excluded too, to avoid to match the port)

Note: obviously this doesn't work if the url has an ipv4 or ipv6 as hostname.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • 1
    Upon further investigation, it looks like the Redshift implementation of Postgres does not include reg_ex matching. Instead, there is a specific function for this purpose, called 'regex_substr'
     select website, regexp_substr(website, '^[^:]*://+([^:/])+' ) as tld 
    – user1601238 Jul 05 '15 at 23:13
1

not any but ok regards robust & fast:

select translate(split_part('https://developer.twitter.com/en/portal/projects/123/apps', '/', 3), '.', ' ');
>  developer twitter com

good for dumping into ts_vector

droid192
  • 2,011
  • 26
  • 43