I want to take a relatively modest set of URLs and resolve them down to the top level domains using one (or more, if required) PostgreSQL queries.
It seems like the major steps to do this are as follows:
- find the number and position of all the "/" in the URL
- If the number of '/' is > 3, and then delete everything to the right of the third '/' (including the last '/')
- Count the number of '.' that appear in the URL that remains from steps 1-2.
- If the # of '.' is 1, then simply remove whatever came before the '://'.
- If the # of '.' > 1, then find the largest '.' and then extract the text between the first "." and the new max length of string.
I have been able to find a few examples: (a) http://www.postgresql.org/message-id/247444.36947.qm@web50311.mail.re2.yahoo.com (b) http://www.seanbehan.com/extract-domain-names-from-links-in-text-with-postgres-and-a-single-sql-query
But none of these seems to work correctly -- I am querying a redshift database, and I get a 'Function not Implemented' error when I attempt to execute.
While there are ways to do this in Python or other languages, I haven't been able to find a SO solution specifically for PostgreSQL.