Get Root Domain of Link

Question

I have a link such as http://www.techcrunch.com/ and I would like to get just the techcrunch.com part of the link. How do I go about this in python?

This might do the trick. http://docs.python.org/library/urlparse.html — Eli, Oct 05 '09 at 18:27

score 35 · Accepted Answer · edited Dec 03 '19 at 06:27

35

Getting the hostname is easy enough using urlparse:

hostname = urlparse.urlparse("http://www.techcrunch.com/").hostname

Getting the "root domain", however, is going to be more problematic, because it isn't defined in a syntactic sense. What's the root domain of "www.theregister.co.uk"? How about networks using default domains? "devbox12" could be a valid hostname.

One way to handle this would be to use the Public Suffix List, which attempts to catalogue both real top level domains (e.g. ".com", ".net", ".org") as well as private domains which are used like TLDs (e.g. ".co.uk" or even ".github.io"). You can access the PSL from Python using the publicsuffix2 library:

import publicsuffix
import urlparse

def get_base_domain(url):
    # This causes an HTTP request; if your script is running more than,
    # say, once a day, you'd want to cache it yourself.  Make sure you
    # update frequently, though!
    psl = publicsuffix.fetch()

    hostname = urlparse.urlparse(url).hostname

    return publicsuffix.get_public_suffix(hostname, psl)

edited Dec 03 '19 at 06:27

Jack

5,354
2
29
54

answered Oct 05 '09 at 18:35

Ben Blank

54,908
28
127
156

Please can you explain how this code hostname = ".".join(len(hostname[-2]) < 4 and hostname[-3:] or hostname[-2:]) works? Thanks – Joozty Aug 23 '17 at 19:31
@Joozty — Negative indices start from the end, so `hostname[-2]` means the next-to-last entry (in this case, the hostname split by dots). `foo and bar or baz` works much like a ternary: if "foo" is true, return "bar"; otherwise, return "baz". Finally, `hostname[-3:]` means the last three parts. All together, this means "If the next-to-last part of the hostname is shorter than four characters, use the last three parts and join them together with dots. Otherwise, take only the last two parts and join them together." – Ben Blank Aug 23 '17 at 21:58
1

For some reason, even after installing the module, on Python 3 I get `ImportError: cannot import name 'get_public_suffix'`. Couldn't find any answer online or in the documentation, so just used "tldextract" instead which just works! Of course, I had to `sudo pip3 install tldextract` first. – Nagev Feb 02 '18 at 17:00

score 17 · Answer 2 · answered Jan 29 '17 at 10:37

General structure of URL:

scheme://netloc/path;parameters?query#fragment

As TIMTOWTDI motto:

Using urlparse,

>>> from urllib.parse import urlparse  # python 3.x
>>> parsed_uri = urlparse('http://www.stackoverflow.com/questions/41899120/whatever')  # returns six components
>>> domain = '{uri.netloc}/'.format(uri=parsed_uri)
>>> result = domain.replace('www.', '')  # as per your case
>>> print(result)
'stackoverflow.com/'

Using tldextract,

>>> import tldextract  # The module looks up TLDs in the Public Suffix List, mantained by Mozilla volunteers
>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')

in your case:

>>> extracted = tldextract.extract('http://www.techcrunch.com/')
>>> '{}.{}'.format(extracted.domain, extracted.suffix)
'techcrunch.com'

tldextract on the other hand knows what all gTLDs [Generic Top-Level Domains] and ccTLDs [Country Code Top-Level Domains] look like by looking up the currently living ones according to the Public Suffix List. So, given a URL, it knows its subdomain from its domain, and its domain from its country code.

Cheerio! :)

score 4 · Answer 3 · answered Feb 06 '17 at 14:34

Following script is not perfect, but can be used for display/shortening purposes. If you really want/need to avoid any 3rd party dependencies - especially remotely fetching and caching some tld data I can suggest you following script which I use in my projects. It uses last two parts of domain for most common domain extensions and leaves last three parts for rest of the less known domain extensions. In worst case scenario domain will have three parts instead of two:

from urlparse import urlparse

def extract_domain(url):
    parsed_domain = urlparse(url)
    domain = parsed_domain.netloc or parsed_domain.path # Just in case, for urls without scheme
    domain_parts = domain.split('.')
    if len(domain_parts) > 2:
        return '.'.join(domain_parts[-(2 if domain_parts[-1] in {
            'com', 'net', 'org', 'io', 'ly', 'me', 'sh', 'fm', 'us'} else 3):])
    return domain

extract_domain('google.com')          # google.com
extract_domain('www.google.com')      # google.com
extract_domain('sub.sub2.google.com') # google.com
extract_domain('google.co.uk')        # google.co.uk
extract_domain('sub.google.co.uk')    # google.co.uk
extract_domain('www.google.com')      # google.com
extract_domain('sub.sub2.voila.fr')   # sub2.voila.fr

azam · Answer 4 · 2015-08-22T16:07:51.890

______Using Python 3.3 and not 2.x________

I would like to add a small thing to Ben Blank's answer.

from urllib.parse import quote,unquote,urlparse
u=unquote(u) #u= URL e.g. http://twitter.co.uk/hello/there
g=urlparse(u)
u=g.netloc

By now, I just got the domain name from urlparse.

To remove the subdomains you first of all need to know which are Top Level Domains and which are not. E.g. in the above http://twitter.co.uk - co.uk is a TLD while in http://sub.twitter.com we have only .com as TLD and sub is a subdomain.

So, we need to get a file/list which has all the tlds.

tlds = load_file("tlds.txt") #tlds holds the list of tlds

hostname = u.split(".")
if len(hostname)>2:
    if hostname[-2].upper() in tlds:
        hostname=".".join(hostname[-3:])
    else:
        hostname=".".join(hostname[-2:])
else:
    hostname=".".join(hostname[-2:])

score 0 · Answer 5 · answered Apr 10 '17 at 12:45

def get_domain(url):
    u = urlsplit(url)
    return u.netloc

def get_top_domain(url):
    u"""
    >>> get_top_domain('http://www.google.com')
    'google.com'
    >>> get_top_domain('http://www.sina.com.cn')
    'sina.com.cn'
    >>> get_top_domain('http://bbc.co.uk')
    'bbc.co.uk'
    >>> get_top_domain('http://mail.cs.buaa.edu.cn')
    'buaa.edu.cn'
    """
    domain = get_domain(url)
    domain_parts = domain.split('.')
    if len(domain_parts) < 2:
        return domain
    top_domain_parts = 2
    # if a domain's last part is 2 letter long, it must be country name
    if len(domain_parts[-1]) == 2:
        if domain_parts[-1] in ['uk', 'jp']:
            if domain_parts[-2] in ['co', 'ac', 'me', 'gov', 'org', 'net']:
                top_domain_parts = 3
        else:
            if domain_parts[-2] in ['com', 'org', 'net', 'edu', 'gov']:
                top_domain_parts = 3
    return '.'.join(domain_parts[-top_domain_parts:])

score 0 · Answer 6 · answered Nov 04 '20 at 13:29

You dont need a package, or any of the complexities people are suggesting to do this, it's as simple as below and tweaking to your liking.

def is_root(url):
    head, sep, tail = url.partition('//')
    is_root_domain = tail.split('/', 1)[0] if '/' in tail else url
    # printing or returning is_root_domain will give you what you seek
    print(is_root_domain)

is_root('http://www.techcrunch.com/')

Hg0428 · Answer 7 · 2021-04-15T15:01:31.473

0

This worked for me:

def get_sub_domains(url):
    urlp = parseurl(url)
    urlsplit = urlp.netloc.split(".")
    l = []
    if len(urlsplit) < 3: return l
    for item in urlsplit:
        urlsplit = urlsplit[1:]
        l.append(".".join(urlsplit))
        if len(urlsplit) < 3:
            return l

edited Apr 15 '21 at 15:01

answered Mar 29 '21 at 20:06

Hg0428

316
4
15

Praveen Kumar · Answer 8 · 2021-12-18T03:06:05.137

0

this simple code will get the root domain name from all valid URLs.

from urllib.parse import urlparse

url = 'https://www.google.com/search?q=python'
root_url = urlparse(url).scheme + '://' + urlparse(url).hostname
print(root_url) # https://www.google.com

edited Dec 18 '21 at 03:06

answered Jul 05 '21 at 21:22

Praveen Kumar

849
8
8

This also parses third level domain. (www.) – Constantin Dec 16 '21 at 11:25
I updated the code. check now. – Praveen Kumar Dec 18 '21 at 03:06

Joe J · Answer 9 · 2010-07-30T07:09:54.197

-4

This worked for my purposes. I figured I'd share it.

".".join("www.sun.google.com".split(".")[-2:])

edited Jul 30 '10 at 07:09

answered Jul 30 '10 at 06:53

Joe J

9,985
16
68
100

3

How about testing "www.sun.google.co.uk" ? You will get "co.uk" instead of "google.co.uk"... Cheers! – Cristian Ciocău Jun 30 '13 at 11:21
4

Ya, use Ben Blank's approach. Not sure what I was thinking (in 2010) :-) – Joe J Oct 30 '13 at 21:46

Get Root Domain of Link

9 Answers9

Linked