7

this simple code makes urlparse get crazy and it does not get the hostname properly but sets it up to None:

from urllib.parse import urlparse
parsed = urlparse("google.com/foo?bar=8")
print(parsed.hostname)

Am I missing something?

user1618465
  • 1,813
  • 2
  • 32
  • 58

4 Answers4

5

According to https://www.rfc-editor.org/rfc/rfc1738#section-2.1:

Scheme names consist of a sequence of characters. The lower case letters "a"--"z", digits, and the characters plus ("+"), period ("."), and hyphen ("-") are allowed. For resiliency, programs interpreting URLs should treat upper case letters as equivalent to lower case in scheme names (e.g., allow "HTTP" as well as "http").

Using advice given in previous answers, I wrote this helper function which can be used in place of urllib.parse.urlparse():

#!/usr/bin/env python3
import re
import urllib.parse

def urlparse(address):
    if not re.search(r'^[A-Za-z0-9+.\-]+://', address):
        address = 'tcp://{0}'.format(address)
    return urllib.parse.urlparse(address)

url = urlparse('localhost:1234')
print(url.hostname, url.port)

A previous version of this function called urllib.parse.urlparse(address), and then prepended the "tcp" scheme if one wasn't found; but this interprets the username as the scheme if you pass it something like "user:pass@localhost:1234".

Community
  • 1
  • 1
Huw Walters
  • 1,888
  • 20
  • 20
3

google.com/foo?bar=8 is a relative URL aka a "path" with a "query". Perhaps you see google.com as a hostname, but it doesn't have to be (and how would python know?)

URLs consist of protocol or scheme ('https:', 'ftp:', etc.), host ('//example.com'), path, query, fragment.

So urlparse is making it's best guess, returning None for protocol and host.

pbuck
  • 4,291
  • 2
  • 24
  • 36
2

Just to add some further context to Muadh's answer. Look at the output from these two variations using urlparse:

>>> parsed = urlparse("google.com/foo?bar=8")
>>> parsed
ParseResult(scheme='', 
            netloc='', 
            path='google.com/foo', 
            params='', 
            query='bar=8', 
            fragment='')

And with the full path specified

>>> parsed = urlparse("http://google.com/foo?bar=8")
>>> parsed
ParseResult(scheme='http', 
            netloc='google.com', 
            path='/foo', 
            params='', 
            query='bar=8', 
            fragment='')
ScottMcC
  • 4,094
  • 1
  • 27
  • 35
0

For this to work properly you have to include the protocol identifier (http://). This is what worked for me:

parsed = urlparse("https://www.google.com/foo?bar=8")
print(parsed.hostname)

The output from here was: www.google.com (which seems expected). More can be read about how to use urlparse here.

Hope this helps you out!

mmghu
  • 595
  • 4
  • 15
  • Yes, I know that works but what if I have a list that does not specify the schema? Actually it's not that weird that they don't – user1618465 May 24 '18 at 00:40
  • 1
    Your link to `urlparse` references python2. `urlparse` is imported differently in python3 – ScottMcC May 24 '18 at 00:41
  • In this case I would check and see if it has a protocol identifier, and if not add it to the string before parsing. – mmghu May 24 '18 at 00:41
  • 1
    @ScottMcC Ah thanks for catching that, I fixed the link. – mmghu May 24 '18 at 00:43