0

I've a couple of websites that are subdomains (e.g., Wordpress, Altervista, Blogpress,...).

I'm currently using url parse for splitting URLs into their elements. However it seems that does not allow to distinguish subdomains, but only tld.

Alternatively, I'd use a vocabulary to include all the subdomain suffixes and, based on that, assign 1 or 0. But since I don't know all the blogs, I'm wondering if there is a way to make automatically the detection.

For example, I was thinking of looking at the dots, but many websites can have a dot in between not being subdomains, so this approach is not good.

martineau
  • 119,623
  • 25
  • 170
  • 301
LdM
  • 674
  • 7
  • 23

1 Answers1

3

I think this library should do the trick https://pypi.org/project/tld/.

Here's an example:

from tld import get_tld
url = "https://artgateblog.altervista.org/"
res = get_tld(url, as_object=True)
blogname, blog_domain = res.domain, res
print(blogname, blog_domain)

Out:

artgateblog altervista.org

EDIT after comments:

For domains that don't include protocol, I think you need to add it with something like the below:

from tld import get_tld
urls = ["12story.altervista.org", "fantasy_story.blogspot.com"]
for url in urls:
    res = get_tld(url, as_object=True, fix_protocol=True)
    blogname, blog_domain = res.domain, res
osint_alex
  • 952
  • 3
  • 16
  • thanks osint_alex. I think get_tld can't be applied in case the protocol is missing. I am getting the message `TldBadUrl: Is not a valid URL` for some URLs – LdM Aug 15 '21 at 23:51
  • Can you give an example of those urls? – osint_alex Aug 15 '21 at 23:51
  • An example is `12story.altervista.org` or `fantasy_story.blogspot.com` . Both protocol and www are missing. I've tried with `try: res = get_tld(url, as_object=True) except: res = get_tld(url, fix_protocol=True)` but it gives the error TldDomainNotFound: Domain 12story didn't match any existing TLD name! – LdM Aug 15 '21 at 23:55
  • 2
    @osint_alex You don't have to do that. There is a `fix_protol` argument that automatically fixes the URL for you. FYI `print(get_tld("12story.altervista.org", fix_protocol=True))` works fine for me. – Selcuk Aug 16 '21 at 00:02