Some websites like http://www.gilacountyaz.gov/government/assessor/index.php have a bunch of internal links that should be absolute paths, but do not have the leading slash. When parsing them with urlparse.urljoin
the result is the following:
>>> import urlparse
>>> a = "http://www.gilacountyaz.gov/government/assessor/index.php"
>>> b = "government/assessor/address_change.php"
>>> urlparse.urljoin(a, b)
'http://www.gilacountyaz.gov/government/assessor/government/assessor/address_change.php'
This causes a web crawler to not realize it already visited a page, with a potentially infinite loop. Firefox and Chrome are able to spot the problem and resolve correctly to
http://www.gilacountyaz.gov/government/assessor/address_change.php
Is there a way to do the same in Python? Note that assuming always a leading slash does not work, because we might be dealing with a genuine relative path.