1

Use Case:

Fail parse https://www.banca-romaneasca.ro/en/tools-and-resources/ with lxml.

...
/opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/html5parser.py:468: in processComment
    self.tree.insertComment(token, self.tree.openElements[-1])
/opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/treebuilders/etree_lxml.py:312: in insertCommentMain
    super(TreeBuilder, self).insertComment(data, parent)
/opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/treebuilders/_base.py:262: in insertComment
    parent.appendChild(self.commentClass(token["data"]))
/opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/treebuilders/etree.py:148: in __init__
    self._element = ElementTree.Comment(data)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

- src/lxml/lxml.etree.pyx:3017: ValueError: Comment may not contain '--' or end with '-'

It came from lxml > https://github.com/lxml/lxml/blob/master/src/lxml/lxml.etree.pyx#L3017

It find bad Comment in https://www.banca-romaneasca.ro/en/tools-and-resources/

...
<script type="text/javascript" src="/_res/js/forms.js"></script>

<!-- Google Code for Remarketing Tag -->
<!--------------------------------------------------
Remarketing tags may not be associated with personally identifiable information or placed on pages related to sensitive categories. See more information and instructions on how to setup the tag on: http://google.com/ads/remarketingsetup
--------------------------------------------------->
<script type="text/javascript">
/* <![CDATA[ */
var google_conversion_id = 958631629;
var google_custom_params = window.google_tag_params;
... 

Ask for solution like:

  • disable check (some magic, flag, on xml)

    if b'--' in text or text.endswith(b'-'):
        raise ValueError("Comment may not contain '--' or end with '-'")
    
  • monkey patching (change code, injection ...)

Update 1:

I using html5lib and want to get tags like sound, section, video... available in html5.

from lxml.html import html5parser, fromstring

context = fromstring(document.content) # work    
context = html5parser.fromstring(document.content) # do not work

context = html5lib.parse(  # do not work
    document.content,
    treebuilder="lxml",
    namespaceHTMLElements=document.namespace,
    encoding=document.encoding
)

versions:

  • html5lib==0.9999999
  • lxml==3.5.0 (downgrade lxml is not a solution also)

Update 2::

It seems this is improvements/issue in lxml https://github.com/lxml/lxml/pull/172#issuecomment-169084439.

Waiting lxml developers feedback.

Update 3::

got feedback, it seems it is html5lib fault, the last dev version from github had fixes already.

Andrei.Danciuc
  • 1,000
  • 10
  • 24

2 Answers2

2

The solution has been found, based on @opottone from github:

I tried installing latest html5parser from github. Now I only get a warning, not an error.

Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
Andrei.Danciuc
  • 1,000
  • 10
  • 24
  • Should probably push a release given the severity of it, actually… *wonders why he didn't do that before* – gsnedders Jan 12 '16 at 16:52
  • @gsnedders who do you meen by "he" ? – Andrei.Danciuc Jan 15 '16 at 09:21
  • Faced this issue in [Calibre](https://github.com/kovidgoyal/calibre) (it was not able to parse Amazon's book metadata due to double hyphens in comments). Upgrading `html5lib` version fixed the issue: `sudo pip2 install html5lib --upgrade` – madhead Sep 20 '16 at 19:29
1

Since this is an HTML data you are trying to parse, use lxml.html and not lxml.etree.

Worked for me:

>>> import requests
>>> import lxml.html
>>> 
>>> data = requests.get("https://www.banca-romaneasca.ro/en/tools-and-resources/").content
>>> tree = lxml.html.fromstring(data)
>>> tree.xpath("//title/text()")
['Tools and resources - Banca Romaneasca']
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • I updated question, with more details. @alecxe, will I be able to get also html5 tags like sound, video, section ? – Andrei.Danciuc Jan 04 '16 at 16:37
  • @Andrei.Danciuc I don't see why not, but give it a try. Thanks! – alecxe Jan 04 '16 at 16:39
  • No :( is not html5 compatible http://lxml.de/html5parser.html (html5lib is a Python package that implements the HTML5 parsing algorithm which is heavily influenced by current browsers and based on the WHATWG HTML5 specification.). – Andrei.Danciuc Jan 04 '16 at 17:28