I have to parse a html page looking for links in it. Unfortunately, the links don't contain the full url (for instance starting with "http://www.example.com/aResource.html"). So my parsing get only the relative URL, for get the whole url address i'm using
urlparse.urljoin()
But often it leads to some connection errors, and generally i would prefer a direct way to extract the comlplete urls link. Here is my code:
import urlparse
import requests
from lxml import html
from lxml import etree
aFile = requests.get(url)
tree = html.fromstring(aFile.text)
linkList = tree.xpath('//a')
urls = []
for link in linkList:
urls.append(str(urlparse.urljoin(url,link.get('href'))))
As you can see i'm working with lxml, but i've also tried with BeautifulSoup without success.