urljoin when an absolute path does not have a leading slash

Question

Some websites like http://www.gilacountyaz.gov/government/assessor/index.php have a bunch of internal links that should be absolute paths, but do not have the leading slash. When parsing them with urlparse.urljoin the result is the following:

>>> import urlparse
>>> a = "http://www.gilacountyaz.gov/government/assessor/index.php"
>>> b = "government/assessor/address_change.php"
>>> urlparse.urljoin(a, b)
'http://www.gilacountyaz.gov/government/assessor/government/assessor/address_change.php'

This causes a web crawler to not realize it already visited a page, with a potentially infinite loop. Firefox and Chrome are able to spot the problem and resolve correctly to

http://www.gilacountyaz.gov/government/assessor/address_change.php

Is there a way to do the same in Python? Note that assuming always a leading slash does not work, because we might be dealing with a genuine relative path.

How are you expecting to distinguish between a genuine relative path and an accidentally relative path that's missing a `/`? What's the rule behind that distinction? — abarnert, Nov 05 '14 at 19:42
I frankly do not know, but Firefox and Chrome are apparently doing it, so I wonder if there is a way. — Mikk, Nov 05 '14 at 19:45

score 8 · Accepted Answer · answered Nov 05 '14 at 19:45

The linked page contains the following:

<head>
  <base href="http://www.gilacountyaz.gov/index.php"/>
</head>

If you use that URL as the first argument to urljoin you'll get the correct result. This tag is what allows your browser to interpret these links correctly.

score 4 · Answer 2 · answered Nov 05 '14 at 19:47

Firefox and Chrome are both reading the <base> tag at the top of the page:

<base href="http://www.gilacountyaz.gov/index.php"/>

Your code needs to use that as the root:

>>> import urlparse
>>> a = "http://www.gilacountyaz.gov/index.php"
>>> b = "government/assessor/address_change.php"
>>> urlparse.urljoin(a, b)
'http://www.gilacountyaz.gov/government/assessor/address_change.php'

urljoin when an absolute path does not have a leading slash

2 Answers2