Not able to parse big HTML using PyQuery

Question

As I'm not sure if the issue I'm facing is a bug or lack of knowledge from my side, I would like to ask for you assistance.

The case is, when trying to parse this url (http://ies.ieee-ies.org/resources/media/publications/TIEpub/1988_2013.htm) using PyQuery, apparently it Loads only the title, and the body is ignored:

>>> import urllib2
>>> from pyquery import PyQuery as pq

>>> response = urllib2.urlopen('http://ies.ieee-ies.org/resources/media/publications/TIEpub/1988_2013.htm').read() # 9MB page
>>> len(response)
9835026
>>> dom = pq(response)
>>> dom.html()
u'<head><title>IEEE Transactions on Industrial Electronics</title></head><body><h1 align="center">&#13;\n   <img border="0" src="ieeelogo.gif"/><font color="#FF6600">\xa0IEEE Tr
ansactions on Industrial Electronics\xa0&#13;\n   <img border="0" src="ieslogo.gif"/></font>&#13;\n   </h1><h2 align="center">&#13;\n   Volume 35, \xa0Number 1, Feb 1988 \xa0\xa
0\xa0\xa0\xa0\xa0\xa0\xa0\xa0&#13;\n   <a href="http://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=41"><font size="4">Access to the journal on IEEE XPLORE</font></a><font s
ize="4"> </font>\xa0\xa0\xa0&#13;\n   <a href="http://tie.ieee-ies.org/"><font size="3">IE Transactions Home Page</font></a><font size="4"> </font> &#13;\n   </h2><hr/><br/><br/
></body>'

Is there a size limit for HTML parsing on PyQuery that I'm not aware of?

PS: I have a work around using different pages which leads to the same content, but I would like to know what is the reason for this.

kindall · Accepted Answer · 2013-05-13T16:04:19.917

2

I'm pretty sure that the problem is not the size, but that the HTML of this page is very broken. It has more than 2000 <html> tags in it, for instance (the correct number is one). I'm shocked that a browser can make any sense of it whatsoever, but the Mozilla devs have a lot of experience with that kind of thing, and I imagine that the PyQuery devs, though undoubtedly diligent, probably have much less. If you can get the content from a different page, then by all means do that, especially if that page is less broken.

edited May 13 '13 at 16:04

answered May 13 '13 at 15:56

kindall

178,883
35
278
309

3

PyQuery uses `lxml`, which does a decent job but as with all broken HTML, your mileage varies, garbage in, garbage out. `html5lib` would handle it differently, usually closer to what browsers do. – Martijn Pieters May 13 '13 at 16:00
OMFG it **actually** does have several thousand `` tags. So much for the technical competency of one of the greatest engineering associations on this planet. – eMPee584 Jul 07 '13 at 12:01

Not able to parse big HTML using PyQuery

1 Answers1