Parsing HTML in Python

Question

What's my best bet for parsing HTML if I can't use BeautifulSoup or lxml? I've got some code that uses SGMLlib but it's a bit low-level and it's now deprecated.

I would prefer if it could stomache a bit of malformed HTML although I'm pretty sure most of the input will be pretty clean.

I was trying to avoid answers getting completely sidetracked. My reasons for avoiding BeautifulSoup are hugely debatable but I was saving that for another day! (My reasons for avoiding lxml are simple - a complete failure to install it on either Mac OSX or Linux :( — Andy Baker, Apr 05 '09 at 09:27
Here is how to install lxml on Linux: `sudo apt-get install libxml2-dev libxslt-dev python2.7-dev` (`python2.6-dev` if you use Python 2.6). Then `sudo pip install lxml`. — Jabba, Aug 12 '11 at 20:32

Andrei Taranchenko · Accepted Answer · 2019-12-17T21:48:06.127

10

Python has a native HTML parser, however the Tidy wrapper Nick suggested would probably be a solid choice as well. Tidy is a very common library, (written in C is it?)

edited Dec 17 '19 at 21:48

answered Apr 04 '09 at 20:00

Andrei Taranchenko

1,266
1
10
22

1

Can someone please tell me as to why do people suggest BeautifulSoup or lxml over the native HTML parser? – Shatu Aug 25 '17 at 20:33
Link is broken… I gues this is [html.parser](https://docs.python.org/3/library/html.parser.html)? Or the [version for legacy Python](https://docs.python.org/2/library/htmlparser.html). – Brutus Dec 17 '19 at 17:36
The module is still there, the URL appears to have changed though. Fixed. – Andrei Taranchenko Dec 17 '19 at 21:48

score 2 · Answer 2 · answered Jun 27 '12 at 17:37

2

You can install lxml and many other python modules easily and seamlessly on the Mac (OS X) using Pallet, which is the MacPorts official GUI

The module name is py27-lxml. Easy as 1,2,3.

answered Jun 27 '12 at 17:37

Gussisaurio

41
2

score 2 · Answer 3 · answered Apr 04 '09 at 18:14

2

Perhaps µTidylib will meet your needs?

answered Apr 04 '09 at 18:14

Nick Presta

28,134
6
57
76

score 1 · Answer 4 · edited Aug 25 '17 at 22:43

1

html5lib is good:
http://code.google.com/p/html5lib/

Update: The link above is broken. A third-party mirror of above, can be accessed from https://github.com/html5lib/gcode-import

edited Aug 25 '17 at 22:43

Shatu

1,819
3
15
27

answered Jun 04 '10 at 11:51

rudyryk

3,695
2
26
33

While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. – dgw Aug 22 '12 at 13:30
This isn't *quite* a link-only answer, @Dgw. It contains a complete sentence mentioning the name of the linked-to library, and in the case of this question, the name of a library *is* the essential part of the answer. Anyone can search for it in case the link is dead. – Rob Kennedy Oct 10 '12 at 15:08

score 1 · Answer 5 · answered Mar 23 '11 at 14:25

1

htql is good at handling malformed html:

http://htql.net/

answered Mar 23 '11 at 14:25

seagulf

380
3
5

score 1 · Answer 6 · answered Apr 04 '09 at 18:29

http://www.xmlhack.com/read.php?item=1392 http://sourceforge.net/projects/pirxx/

http://pyxml.sourceforge.net/topics/

I don't have much experience with python, but I have used Xerces (from the Apache foundation) in the past and found it to be very useful. The learning curve isn't bad either, though I'm not coming from a python perspective. I suggest you consider it though. (The first two links I've included discuss python interfaces to Xerces and the last one is the first google hit on "python xml").

I know you want an HTML parser, but these will be good starting places. — Joe Bane, Apr 04 '09 at 18:31

Parsing HTML in Python

6 Answers6

Linked

Related