0

Say I have an expat parser instantiated like so:

def on_character_data(data):
    print(data)

parser = xml.parsers.expat.ParserCreate(encoding=encoding)
...
parser.CharacterDataHandler = on_character_data
...

And an XML document like so:

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  </head>
<body>
  ampersands &amp; other annoyances
</body>
</html>

If I call parser.Parse(test_xml_string) The handler on_character_data() will receive the string ampersands &amp; other annoyances as ampersands & other annoyances with the &amp; replaced with &. I want expat to ignore these entities, so that on_character_data() will receive the unmodified ampersands &amp; other annoyances. Is there any way I can do this?

midrare
  • 2,371
  • 28
  • 48
  • By "these entities", do you mean any entities? XML has just five predefined entities. – mzjn Oct 16 '21 at 14:14
  • @mzjn Yes, any entity. I don't want any entity unescaping at all. – midrare Oct 17 '21 at 01:30
  • What about numeric character references, such as `A`? – mzjn Oct 17 '21 at 09:47
  • @mzjn Yes, including those. I want everything in the `&xxx;` form in their original, unescaped forms. – midrare Oct 18 '21 at 01:14
  • This does not seem to be trivial to accomplish. Why do you need this? – mzjn Oct 18 '21 at 05:54
  • @mzjn I'm trying to parse HTML with output source mapping info. The only XML parsing library I've found that can do this is `expat`, through `Parser.CurrentByteIndex`. The idea is to build the tree while saving the value of `Parser.CurrentByteIndex` so I end up with a map of each element's start and end byte offsets within the original HTML bytestring. – midrare Oct 19 '21 at 23:16
  • @mzjn In the character data handler, `Parser.CurrentByteIndex` points to the byte offset of the *end* of the character data in the original markup bytestring. The *start* byte offset *should be* `Parser.CurrentByteIndex - len(char_data_str.encode('utf-8'))`. But if there was one or more entity expansions, `Parser.CurrentByteIndex - len(char_data_str.encode('utf-8'))` won't match the start byte offset anymore. – midrare Oct 19 '21 at 23:17
  • OK, I see. The explanation in your last two comments should be in the question itself. – mzjn Oct 20 '21 at 04:10

0 Answers0