Problems parsing a web page in python

Question

I would like to parse a web page in order to retrieve some information about it (my exact problem is to retrieve all the items in this list : http://www.computerhope.com/vdef.htm).

However, I can't figure out how to do it.

A lot of tutorials on the internet start with this (simplified) : html5lib.parse(urlopen("http://www.computerhope.com/vdef.htm"))

But after that, none of the tutorials explain how I can browse the document and go the html part I am looking for.

Some other tutorials explain how to do it with CSSSelector but again, all the tutorials don't start with a web page but with a string instead (e.g. here : http://lxml.de/cssselect.html).

So I tried to create a tree with the web page using this : fromstring(urlopen("http://www.computerhope.com/vdef.htm").read()) but I got this error : lxml.etree.XMLSyntaxError: Specification mandate value for attribute itemscope, line 3, column 28. This error is due to the fact that there is an attribute that is not specified (e.g. <input attribute></input>) but as I don't control the webpage, I can't go around it.

So here are a few questions that could solve my problems :

How can I browse a tree ?
Is there a way to make the parser less strict ?

Thank you !

Look for XPath. It is very powerful tool to parse any XML-like structure. — Łukasz Szcześniak, Jul 27 '16 at 19:03

score 2 · Answer 1 · edited Oct 20 '22 at 01:12

2

Try using beautiful soup, it has some excellent features and makes parsing in Python extremely easy.

Check of their documentation at https://www.crummy.com/software/BeautifulSoup/bs4/doc/

from bs4 import BeautifulSoup
import requests

page = requests.get('http://www.computerhope.com/vdef.htm')
soup = BeautifulSoup(page.text)
tables = soup.findChildren('table')
for i in (tables[0].findAll('a')):
    print(i.text)

It prints out all the items in the list, I hope the OP Will make adjustments accordingly.

edited Oct 20 '22 at 01:12

miken32

42,008
16
111
154

answered Jul 27 '16 at 17:50

Bharat

287
1
5
14

Can I know why the down vote?? If you do not like the library that does not mean my answer is wrong, that just means our opinions differ. – Bharat Jul 27 '16 at 18:59
2

I did not downvote but I guess that you got the downvote because this is a link-only answer with no details that actually answer the question.. – mzjn Jul 27 '16 at 19:28

Problems parsing a web page in python

1 Answers1