-1

I would like to parse a web page in order to retrieve some information about it (my exact problem is to retrieve all the items in this list : http://www.computerhope.com/vdef.htm).

However, I can't figure out how to do it.

A lot of tutorials on the internet start with this (simplified) : html5lib.parse(urlopen("http://www.computerhope.com/vdef.htm"))

But after that, none of the tutorials explain how I can browse the document and go the html part I am looking for.

Some other tutorials explain how to do it with CSSSelector but again, all the tutorials don't start with a web page but with a string instead (e.g. here : http://lxml.de/cssselect.html).

So I tried to create a tree with the web page using this : fromstring(urlopen("http://www.computerhope.com/vdef.htm").read()) but I got this error : lxml.etree.XMLSyntaxError: Specification mandate value for attribute itemscope, line 3, column 28. This error is due to the fact that there is an attribute that is not specified (e.g. <input attribute></input>) but as I don't control the webpage, I can't go around it.

So here are a few questions that could solve my problems :

  • How can I browse a tree ?
  • Is there a way to make the parser less strict ?

Thank you !

clementescolano
  • 485
  • 5
  • 15

1 Answers1

2

Try using beautiful soup, it has some excellent features and makes parsing in Python extremely easy.

Check of their documentation at https://www.crummy.com/software/BeautifulSoup/bs4/doc/

from bs4 import BeautifulSoup
import requests

page = requests.get('http://www.computerhope.com/vdef.htm')
soup = BeautifulSoup(page.text)
tables = soup.findChildren('table')
for i in (tables[0].findAll('a')):
    print(i.text)

It prints out all the items in the list, I hope the OP Will make adjustments accordingly.

miken32
  • 42,008
  • 16
  • 111
  • 154
Bharat
  • 287
  • 1
  • 5
  • 14
  • Can I know why the down vote?? If you do not like the library that does not mean my answer is wrong, that just means our opinions differ. – Bharat Jul 27 '16 at 18:59
  • 2
    I did not downvote but I guess that you got the downvote because this is a link-only answer with no details that actually answer the question.. – mzjn Jul 27 '16 at 19:28