Use Case:
Fail parse https://www.banca-romaneasca.ro/en/tools-and-resources/ with lxml.
...
/opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/html5parser.py:468: in processComment
self.tree.insertComment(token, self.tree.openElements[-1])
/opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/treebuilders/etree_lxml.py:312: in insertCommentMain
super(TreeBuilder, self).insertComment(data, parent)
/opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/treebuilders/_base.py:262: in insertComment
parent.appendChild(self.commentClass(token["data"]))
/opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/treebuilders/etree.py:148: in __init__
self._element = ElementTree.Comment(data)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
- src/lxml/lxml.etree.pyx:3017: ValueError: Comment may not contain '--' or end with '-'
It came from lxml > https://github.com/lxml/lxml/blob/master/src/lxml/lxml.etree.pyx#L3017
It find bad Comment in https://www.banca-romaneasca.ro/en/tools-and-resources/
...
<script type="text/javascript" src="/_res/js/forms.js"></script>
<!-- Google Code for Remarketing Tag -->
<!--------------------------------------------------
Remarketing tags may not be associated with personally identifiable information or placed on pages related to sensitive categories. See more information and instructions on how to setup the tag on: http://google.com/ads/remarketingsetup
--------------------------------------------------->
<script type="text/javascript">
/* <![CDATA[ */
var google_conversion_id = 958631629;
var google_custom_params = window.google_tag_params;
...
Ask for solution like:
disable check (some magic, flag, on xml)
if b'--' in text or text.endswith(b'-'): raise ValueError("Comment may not contain '--' or end with '-'")
monkey patching (change code, injection ...)
Update 1:
I using html5lib and want to get tags like sound, section, video... available in html5.
from lxml.html import html5parser, fromstring
context = fromstring(document.content) # work
context = html5parser.fromstring(document.content) # do not work
context = html5lib.parse( # do not work
document.content,
treebuilder="lxml",
namespaceHTMLElements=document.namespace,
encoding=document.encoding
)
versions:
- html5lib==0.9999999
- lxml==3.5.0 (downgrade lxml is not a solution also)
Update 2::
It seems this is improvements/issue in lxml https://github.com/lxml/lxml/pull/172#issuecomment-169084439.
Waiting lxml developers feedback.
Update 3::
got feedback, it seems it is html5lib fault, the last dev version from github had fixes already.