Parseing xml and html in same project

Question

I want to parse in one project XML and HTML at the same time.

I tried this:

from xml.etree import ElementTree as ET

tree = ET.parse(fpath)
html_file = ET.parse(htmlpath)

and got this error:

Traceback (most recent call last): File "C:.py", line 55, in html_file = ET.parse("htmlpath") File "C:\Users\AppData\Local\Programs\Python\Python37-32\lib\xml\etree\ElementTree.py", line 1197, in parse tree.parse(source, parser) File "C:\Users\AppData\Local\Programs\Python\Python37-32\lib\xml\etree\ElementTree.py", line 598, in parse self._root = parser._parse_whole(source) xml.etree.ElementTree.ParseError: undefined entity  : line 690, column 78

The document referenced by `html_path` is not well-formed, and therefore it cannot be parsed as XML (ElementTree works with XML, not arbitrary HTML). The problem is that the document contains the ` ` entity reference without the corresponding declaration for the entity. See https://stackoverflow.com/q/14744945/407651. — mzjn, Aug 15 '19 at 09:56
I suggest that you try the BeautifulSoup library: https://pypi.org/project/beautifulsoup4/. You can use it for both XML and HTML. — mzjn, Aug 15 '19 at 13:47

score 0 · Answer 1 · answered May 15 '23 at 22:00

The nbsp is a standard html5 entity. It may help to convert those to their unicode characters before running the xml parser. In python3.4+ you can use html.unescape for that.

from html import escape, unescape
textXML = re.sub("\\&\\w+\\;", lambda x: escape(unescape(x.group(0))), text)

Parseing xml and html in same project

1 Answers1