3

Some websites like http://www.gilacountyaz.gov/government/assessor/index.php have a bunch of internal links that should be absolute paths, but do not have the leading slash. When parsing them with urlparse.urljoin the result is the following:

>>> import urlparse
>>> a = "http://www.gilacountyaz.gov/government/assessor/index.php"
>>> b = "government/assessor/address_change.php"
>>> urlparse.urljoin(a, b)
'http://www.gilacountyaz.gov/government/assessor/government/assessor/address_change.php'

This causes a web crawler to not realize it already visited a page, with a potentially infinite loop. Firefox and Chrome are able to spot the problem and resolve correctly to

http://www.gilacountyaz.gov/government/assessor/address_change.php

Is there a way to do the same in Python? Note that assuming always a leading slash does not work, because we might be dealing with a genuine relative path.

Mikk
  • 804
  • 8
  • 23
  • How are you expecting to distinguish between a genuine relative path and an accidentally relative path that's missing a `/`? What's the rule behind that distinction? – abarnert Nov 05 '14 at 19:42
  • I frankly do not know, but Firefox and Chrome are apparently doing it, so I wonder if there is a way. – Mikk Nov 05 '14 at 19:45

2 Answers2

8

The linked page contains the following:

<head>
  <base href="http://www.gilacountyaz.gov/index.php"/>
</head>

If you use that URL as the first argument to urljoin you'll get the correct result. This tag is what allows your browser to interpret these links correctly.

Dan Rice
  • 620
  • 5
  • 6
4

Firefox and Chrome are both reading the <base> tag at the top of the page:

<base href="http://www.gilacountyaz.gov/index.php"/>

Your code needs to use that as the root:

>>> import urlparse
>>> a = "http://www.gilacountyaz.gov/index.php"
>>> b = "government/assessor/address_change.php"
>>> urlparse.urljoin(a, b)
'http://www.gilacountyaz.gov/government/assessor/address_change.php'
Brent Washburne
  • 12,904
  • 4
  • 60
  • 82