How to prevent expat from automatically substituting entities?

Question

Say I have an expat parser instantiated like so:

def on_character_data(data):
    print(data)

parser = xml.parsers.expat.ParserCreate(encoding=encoding)
...
parser.CharacterDataHandler = on_character_data
...

And an XML document like so:

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  </head>
<body>
  ampersands &amp; other annoyances
</body>
</html>

If I call parser.Parse(test_xml_string) The handler on_character_data() will receive the string ampersands & other annoyances as ampersands & other annoyances with the & replaced with &. I want expat to ignore these entities, so that on_character_data() will receive the unmodified ampersands & other annoyances. Is there any way I can do this?

By "these entities", do you mean any entities? XML has just five predefined entities. — mzjn, Oct 16 '21 at 14:14
@mzjn Yes, any entity. I don't want any entity unescaping at all. — midrare, Oct 17 '21 at 01:30
@mzjn Yes, including those. I want everything in the `&xxx;` form in their original, unescaped forms. — midrare, Oct 18 '21 at 01:14
This does not seem to be trivial to accomplish. Why do you need this? — mzjn, Oct 18 '21 at 05:54
@mzjn I'm trying to parse HTML with output source mapping info. The only XML parsing library I've found that can do this is `expat`, through `Parser.CurrentByteIndex`. The idea is to build the tree while saving the value of `Parser.CurrentByteIndex` so I end up with a map of each element's start and end byte offsets within the original HTML bytestring. — midrare, Oct 19 '21 at 23:16
@mzjn In the character data handler, `Parser.CurrentByteIndex` points to the byte offset of the *end* of the character data in the original markup bytestring. The *start* byte offset *should be* `Parser.CurrentByteIndex - len(char_data_str.encode('utf-8'))`. But if there was one or more entity expansions, `Parser.CurrentByteIndex - len(char_data_str.encode('utf-8'))` won't match the start byte offset anymore. — midrare, Oct 19 '21 at 23:17
OK, I see. The explanation in your last two comments should be in the question itself. — mzjn, Oct 20 '21 at 04:10

How to prevent expat from automatically substituting entities?

0 Answers0