1

I want to get "Main content" instead of < tag> Main content , where the latter is html code and could be retrieved using urllib.urlopen(url).

Just as you open the url in browser, select all text and then copy&paste.

Is there a possible way for this with Python?

Thanks.

ibread
  • 1,165
  • 1
  • 10
  • 18
  • Duplicate? http://stackoverflow.com/questions/3172343/extracting-readable-text-from-html-using-python – msanders Jul 15 '10 at 10:03

1 Answers1

3

Have a look at Beautiful Soup.

Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful:

  1. Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.
  2. Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application.
  3. Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding.
Community
  • 1
  • 1
Jon Cage
  • 36,366
  • 38
  • 137
  • 215