Best way to convert HTML to plaintext using Python

Question

I'm working on a project that involves converting a large amount of HTML content to plain/text. I have a custom-written module that does the job OK, but I'm wondering if there's some standard tools to help get the job done.

score 10 · Accepted Answer · answered Nov 03 '09 at 15:37

10

Html2Text seems to be a good option

answered Nov 03 '09 at 15:37

Chris Ballance

33,810
26
104
151

The site is no longer accessible since Aaron, the author is no longer. – black_puppydog Apr 10 '13 at 15:02

tcarobruce · Answer 2 · 2012-11-30T20:57:49.743

4

Here's a python library which does HTML parsing:

lxml.html

BeautifulSoup is another option.

edited Nov 30 '12 at 20:57

answered Nov 03 '09 at 15:39

tcarobruce

3,773
21
33

2

To save others some time circling from Google back to SO, here is a Q&A describing that Beautiful Soup is not really maintained anymore: [WebScraping with BeautifulSoup or LXML.HTML](http://stackoverflow.com/questions/5493514/webscraping-with-beautifulsoup-or-lxml-html). – sage Jul 14 '11 at 14:59
1

Beautiful Soup appears to be maintained now I think. – contrebis Nov 29 '12 at 15:49

Best way to convert HTML to plaintext using Python

2 Answers2

Linked