0

I want to parse in one project XML and HTML at the same time.

I tried this:

from xml.etree import ElementTree as ET

tree = ET.parse(fpath)
html_file = ET.parse(htmlpath)

and got this error:

Traceback (most recent call last): File "C:.py", line 55, in html_file = ET.parse("htmlpath") File "C:\Users\AppData\Local\Programs\Python\Python37-32\lib\xml\etree\ElementTree.py", line 1197, in parse tree.parse(source, parser) File "C:\Users\AppData\Local\Programs\Python\Python37-32\lib\xml\etree\ElementTree.py", line 598, in parse self._root = parser._parse_whole(source) xml.etree.ElementTree.ParseError: undefined entity  : line 690, column 78

mzjn
  • 48,958
  • 13
  • 128
  • 248
Jonas Liddell
  • 31
  • 1
  • 2
  • The document referenced by `html_path` is not well-formed, and therefore it cannot be parsed as XML (ElementTree works with XML, not arbitrary HTML). The problem is that the document contains the ` ` entity reference without the corresponding declaration for the entity. See https://stackoverflow.com/q/14744945/407651. – mzjn Aug 15 '19 at 09:56
  • I suggest that you try the BeautifulSoup library: https://pypi.org/project/beautifulsoup4/. You can use it for both XML and HTML. – mzjn Aug 15 '19 at 13:47

1 Answers1

0

The nbsp is a standard html5 entity. It may help to convert those to their unicode characters before running the xml parser. In python3.4+ you can use html.unescape for that.

from html import escape, unescape
textXML = re.sub("\\&\\w+\\;", lambda x: escape(unescape(x.group(0))), text)
Guido U. Draheim
  • 3,038
  • 1
  • 20
  • 19