0

I've been looking for a way to parse the domain from an URL. There's a ton of libraries but I haven't found a complete one. I'm currently using urllib.parse. Which returns nothing when parsing a domain with a dash (-) in it. Are there other options I should concider using?

Example:

from urllib.parse import urlparse

print(urlparse("www.bax-shop.nl/muziekwinkel-goes").netloc)

Output:



Process finished with exit code 0

Edit: It seems to be working with https:// in front of the URL. Which I find a bit strange.

  • if you porvide it the scheme/protocol type it will parse well `urlparse("http://www.bax-shop.nl/muziekwinkel-goes")` – Chris Doyle Mar 24 '21 at 16:40
  • @ChrisDoyle problem is that it's parsing a few million URLs with and without https. Is there some sort of parameter I can add? – Studentdev Mar 24 '21 at 16:51
  • 1
    Well then they are not really URL's, urls have a specification `:` so if you have data which doesnt have a schema like `http://` `https://` `ftp://` then you wont be able to parse them with urlparse since they are not valid urls. You could just add some code to say it doesnt have a schema, just prepend `http://` to it then give it to url parse – Chris Doyle Mar 24 '21 at 16:55

1 Answers1

0

As others have already stated in the comments, every URL should begin with a scheme, most likely http or https in your case. There is nothing strange about that, scheme is essential to make URL parsers understand what they should do (which protocol to use to connect to the address). Of course, you could make a parser that would accept a URL-like string (again, not a real URL, because it can't go without the first part) and extract the domain name from it.

In your case I would do something like that:

from urllib.parse import urlparse


def get_domain_name(url):
    if '://' not in url:
        # Making `http` the default protocol so that urllib handles url correctly
        url = 'http://' + url

    return urlparse(url).netloc


if __name__ == "__main__":
    print(get_domain_name("https://stackoverflow.com/"))
    print(get_domain_name("www.bax-shop.nl/muziekwinkel-goes"))
Kolay.Ne
  • 1,345
  • 1
  • 8
  • 23