0

I have to parse a html page looking for links in it. Unfortunately, the links don't contain the full url (for instance starting with "http://www.example.com/aResource.html"). So my parsing get only the relative URL, for get the whole url address i'm using

urlparse.urljoin()

But often it leads to some connection errors, and generally i would prefer a direct way to extract the comlplete urls link. Here is my code:

import urlparse
import requests
from lxml import html
from lxml import etree

aFile = requests.get(url)
tree = html.fromstring(aFile.text)

linkList = tree.xpath('//a')

urls = []

for link in linkList:
    urls.append(str(urlparse.urljoin(url,link.get('href'))))

As you can see i'm working with lxml, but i've also tried with BeautifulSoup without success.

accand
  • 531
  • 1
  • 8
  • 17
  • Possible duplicate: http://stackoverflow.com/questions/717541/parsing-html-in-python?rq=1 – nchen24 Dec 10 '14 at 10:16
  • @PadraicCunningham The url is something like that: http://example.com/path/0VPZUJL06JKS/U09R71.html. And in the link tag are specified just the element from the last "/" – accand Dec 10 '14 at 10:27
  • @user2567853 You mean that the schema (http://) is missing ? – Cld Dec 10 '14 at 10:36
  • @Cld I mean that this part is missing: "http://example.com/path/0VPZUJL06JKS/" – accand Dec 10 '14 at 10:41
  • And this part is not in your "main" URL ? In this case the problem is not in the code but in the page who can't even work in a browser... – Cld Dec 10 '14 at 10:46

1 Answers1

0

Since the information (URL scheme, host server, port, path - base URL) is missing in <a href=""...>, it needs to be added to the relative URL.

Usually it is correct to use urlparse.urljoin() as you are already doing.

HTML does allow specification of a base url for the page using the <base href="..."> tag, which must be defined once in <head>. If this tag is present you should use it's href attribute as your base URL for urljoin(). Your code could be revised to this:

import urlparse
import requests
from lxml import html
from lxml import etree

aFile = requests.get(url)
tree = html.fromstring(aFile.text)

linkList = tree.xpath('//a')

urls = []

try:
    base_url = tree.xpath('//base[1]/@href')[0]
except IndexError:
    base_url = url

for link in linkList:
    urls.append(str(urlparse.urljoin(base_url,link.get('href'))))

However, if you are getting connection errors, it would appear that some of the links are invalid. Either the base URL derived from the page's URL, or from the <base href="..."> tag, is correct. Any invalid URLs constructed with this value must be due to an invalid relative URL (or invalid <base> tag).

Have you concrete examples of the URL used when connection errors are experienced?

You could also look at mechanize:

import mechanize

br = mechanize.Browser()
resp = br.open(url)
urls = [link.absolute_url for link in br.links()]
mhawke
  • 84,695
  • 9
  • 117
  • 138