You've found a bug in whichever parser you're using.
I don't know which parser you're using but I do know this:
Python 2.7.2 (from Apple), BS 4.1.3 (from pip), libxml2 2.9.0 (from Homebrew), lxml 3.1.0 (from pip) gets the exact same error as you. Everything else I try—including the same things as above except libxml2 2.7.8 (from Apple)—works. And lxml
is the default (at least as of 4.1.3) that BS will try first if you don't specify anything else. And I've seen other unexpected bugs with libxml2 2.9.0 (most of which have been fixed on trunk, but no 2.9.1 has been released yet).
So, if this is your problem, you may want to downgrade to 2.8.0 and/or build it from top of tree.
But if not… it definitely works for me with 2.7.2 with the stdlib html.parser
, and in chat you tested the same think with 2.7.1. While html.parser
(especially before 2.7.3) is slow and brittle, it seems to be good enough for you. So, the simplest solution is to do this:
soup = BeautifulSoup(content, 'html.parser')
… instead of just letting it pick its favorite parser.
For more info, see Specifying the parser to use (and the sections right above and below).