I would like to parse a web page in order to retrieve some information about it (my exact problem is to retrieve all the items in this list : http://www.computerhope.com/vdef.htm).
However, I can't figure out how to do it.
A lot of tutorials on the internet start with this (simplified) :
html5lib.parse(urlopen("http://www.computerhope.com/vdef.htm"))
But after that, none of the tutorials explain how I can browse the document and go the html part I am looking for.
Some other tutorials explain how to do it with CSSSelector
but again, all the tutorials don't start with a web page but with a string instead (e.g. here : http://lxml.de/cssselect.html).
So I tried to create a tree with the web page using this :
fromstring(urlopen("http://www.computerhope.com/vdef.htm").read())
but I got this error :
lxml.etree.XMLSyntaxError: Specification mandate value for attribute itemscope, line 3, column 28
. This error is due to the fact that there is an attribute that is not specified (e.g. <input attribute></input>
) but as I don't control the webpage, I can't go around it.
So here are a few questions that could solve my problems :
- How can I browse a tree ?
- Is there a way to make the parser less strict ?
Thank you !