I need to convert relative URLs from a HTML page to absolute ones. I'm using pyquery for parsing.
For instance, this page http://govp.info/o-gorode/gorozhane has relative URLs in the source code, like
<a href="o-gorode/gorozhane?page=2">2</a>
(this is the pagination link at the bottom of the page). I'm trying to use make_links_absolute()
:
import requests
from pyquery import PyQuery as pq
page_url = 'http://govp.info/o-gorode/gorozhane'
resp = requests.get(page_url)
page = pq(resp.text)
page.make_links_absolute(page_url)
but it seems that this breaks the relative links:
print(page.find('a[href*="?page=2"]').attr['href'])
# prints http://govp.info/o-gorode/o-gorode/gorozhane?page=2
# expected value http://govp.info/o-gorode/gorozhane?page=2
As you can see there is doubled o-gorode
in the middle of the final URL that definitely will produce 404 error.
Internally pyquery uses urljoin
from the standard urllib.parse
module, somewhat like this:
from urllib.parse import urljoin
urljoin('http://example.com/one/', 'two')
# -> 'http://example.com/one/two'
It's ok, but there are a lot of sites that have, hmm, unusual relative links with a full path.
And in this case urljoin
will give us an invalid absolute link:
urljoin('http://govp.info/o-gorode/gorozhane', 'o-gorode/gorozhane?page=2')
# -> 'http://govp.info/o-gorode/o-gorode/gorozhane?page=2'
I believe such relative links are not very valid, but Google Chrome has no problem to deal with them; so I guess this is kind of normal across the web.
Are there any advice how to solve this problem? I tried furl
but it does the same join.