14

I'm trying to parse some html in Python. There were some methods that actually worked before... but nowadays there's nothing I can actually use without workarounds.

  • beautifulsoup has problems after SGMLParser went away
  • html5lib cannot parse half of what's "out there"
  • lxml is trying to be "too correct" for typical html (attributes and tags cannot contain unknown namespaces, or an exception is thrown, which means almost no page with Facebook connect can be parsed)

What other options are there these days? (if they support xpath, that would be great)

viraptor
  • 33,322
  • 10
  • 107
  • 191
  • You need to give us examples of pages that your current approaches fail on. Otherwise, how will we know if our proposed solutions will solve your problems? Also, don't forget to report the html5lib bugs at http://code.google.com/p/html5lib/issues/entry – Gareth Rees Nov 06 '10 at 19:40

5 Answers5

19

Make sure that you use the html module when you parse HTML with lxml:

>>> from lxml import html
>>> doc = """<html>
... <head>
...   <title> Meh
... </head>
... <body>
... Look at this interesting use of <p>
... rather than using <br /> tags as line breaks <p>
... </body>"""
>>> html.document_fromstring(doc)
<Element html at ...>

All the errors & exceptions will melt away, you'll be left with an amazingly fast parser that often deals with HTML soup better than BeautifulSoup.

Tim McNamara
  • 18,019
  • 4
  • 52
  • 83
  • That's interesting. I always used lxml via a treebuilder. I was pretty sure that HTMLParser used this way forced html mode. Apparently not. lxml.html parses the stuff like `` correctly. (where lxml threw an exception) – viraptor Nov 06 '10 at 20:24
  • Am glad to hear that thinks are working well for you! Thanks for accepting the answer. – Tim McNamara Nov 06 '10 at 20:25
7

I've used pyparsing for a number of HTML page scraping projects. It is a sort of middle-ground between BeautifulSoup and the full HTML parsers on one end, and the too-low-level approach of regular expressions (that way lies madness).

With pyparsing, you can often get good HTML scraping results by identifying the specific subset of the page or data that you are trying to extract. This approach avoids the issues of trying to parse everything on the page, since some problematic HTML outside of your region of interest could throw off a comprehensive HTML parser.

While this sounds like just a glorified regex approach, pyparsing offers builtins for working with HTML- or XML-tagged text. Pyparsing avoids many of the pitfalls that frustrate the regex-based solutions:

  • accepts whitespace without littering '\s*' all over your expression
  • handles unexpected attributes within tags
  • handles attributes in any order
  • handles upper/lower case in tags
  • handles attribute names with namespaces
  • handles attribute values in double quotes, single quotes, or no quotes
  • handles empty tags (those of the form <blah />)
  • returns parsed tag data with object-attribute access to tag attributes

Here's a simple example from the pyparsing wiki that gets <a href=xxx> tags from a web page:

from pyparsing import makeHTMLTags, SkipTo

# read HTML from a web page
page = urllib.urlopen( "http://www.yahoo.com" )
htmlText = page.read()
page.close()

# define pyparsing expression to search for within HTML    
anchorStart,anchorEnd = makeHTMLTags("a")
anchor = anchorStart + SkipTo(anchorEnd).setResultsName("body") + anchorEnd

for tokens,start,end in anchor.scanString(htmlText):
    print tokens.body,'->',tokens.href

This will pull out the <a> tags, even if there are other portions of the page containing problematic HTML. There are other HTML examples at the pyparsing wiki:

Pyparsing is not a total foolproof solution to this problem, but by exposing the parsing process to you, you can better control which pieces of the HTML you are specifically interested in, process them, and skip the rest.

PaulMcG
  • 62,419
  • 16
  • 94
  • 130
3

If you are scraping content, an excellent way to get around irritating details is the sitescraper package. It uses machine learning to determine which content to retrieve for you.

From the homepage:

>>> from sitescraper import sitescraper
>>> ss = sitescraper()
>>> url = 'http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3Daps&field-keywords=python&x=0&y=0'
>>> data = ["Amazon.com: python", 
             ["Learning Python, 3rd Edition", 
             "Programming in Python 3: A Complete Introduction to the Python Language (Developer's Library)", 
             "Python in a Nutshell, Second Edition (In a Nutshell (O'Reilly))"]]
>>> ss.add(url, data)
>>> # we can add multiple example cases, but this is a simple example so 1 will do (I   generally use 3)
>>> # ss.add(url2, data2) 
>>> ss.scrape('http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3Daps&field-  keywords=linux&x=0&y=0')
["Amazon.com: linux", ["A Practical Guide to Linux(R) Commands, Editors, and Shell    Programming", 
"Linux Pocket Guide", 
"Linux in a Nutshell (In a Nutshell (O'Reilly))", 
'Practical Guide to Ubuntu Linux (Versions 8.10 and 8.04), A (2nd Edition)', 
'Linux Bible, 2008 Edition: Boot up to Ubuntu, Fedora, KNOPPIX, Debian, openSUSE, and 11 Other Distributions']]
Tim McNamara
  • 18,019
  • 4
  • 52
  • 83
3

html5lib cannot parse half of what's "out there"

That sounds extremely implausible. html5lib uses exactly the same algorithm that's also implemented in recent versions of Firefox, Safari and Chrome. If that algorithm broke half the web, I think we would have heard. If you have particular problems with it, do file bugs.

Ms2ger
  • 15,596
  • 6
  • 36
  • 35
  • Ok - maybe not half, but it broke on some script tags (don't remember the site), misses a big chunk of youtube (sometimes) and other sites I tried to use it with. I'll report stuff that I can reproduce. – viraptor Nov 06 '10 at 21:13
  • 1
    Script tags are a horrible mess, indeed, but their handling changed quite a bit recently. I hope you'll find it works better now. – Ms2ger Nov 10 '10 at 20:37
2

I think the problem is that most HTML is ill-formed. XHTML tried to fix that, but it never really caught on enough - especially as most browsers do "intelligent workarounds" for ill-formed code.

Even a few years ago I tried to parse HTML for a primitive spider-type app, and found the problems too difficult. I suspect writing your own might be on the cards, although we can't be the only people with this problem!

winwaed
  • 7,645
  • 6
  • 36
  • 81