1

I am seeing "SAXParseException: output_pdml.xml:1089:0: not well-formed (invalid token)" error message while parsing pcap log in PDML .xml file format. Actual root cause of this failure is due to this string in the decoded log message "". The characters "<><>" is making actual failure. I was going through the xml sax documentation in python, there is escape function available to replace these string in its supporting format

xml.sax.saxutils.escape(data, entities={}) Escape '&', '<', and '>' in a string of data. But I am not able to get how to invoke this function , as xml sax is completely working in event driven approach.In my code most of the parsing activities happening though startElement and endElement functions. I feel this exception is raised internally while invoking the startElement function

def startElement(self, data, attr): In startElement function just checking the type of the data element and it created a python object structure, I assume error will happen during these steps. My question here is how to invoke the escape while creating the parser instance and invoking the content handler

 parser = xml.sax.make_parser()
 parser.setContentHandler(content_handler_fun(call_back))
 parser.setFeature(xml.sax.handler.feature_external_ges, False)
 parser.parse(stdout_stream)

How do I invoke the escape function before invoking the parse() function

Tried for calling escape() method like this 
 parser = xml.sax.make_parser()
 parser.setContentHandler(cont`your text`ent_handler_fun(call_back))
 parser.setFeature(xml.sax.handler.feature_external_ges, False)
 data = xml.sax.saxutils.escape(stdout_stream)
 parser.parse(data)

Failure observed: '_io.BufferedReader' object has no attribute 'replace' I feel escape() expecting a string data, but my case I have to parse the content of stdout stream, it wont possible to read all stream content in string format and parse, because the content size is too huge, it is not practical to load complete content in to memory instead of referring a stream reference

chandra
  • 11
  • 1
  • Please share a minimal code and reproduceable example with a xml dummy as input, what shows your error message. – Hermann12 Jun 02 '23 at 22:28
  • @Hermann12, please find the python code and xml file content which is failing `import xml.sax class XmlParser(xml.sax.handler.ContentHandler): def characters(self, chars): print("chars....",chars) if __name__ == '__main__': filename = "fail_xml.xml" fh = open(filename, "r") parser = xml.sax.make_parser() handler = XmlParser() parser.setContentHandler(handler) parser.setFeature(xml.sax.handler.feature_external_ges, False) parser.parse(fh)` **fail_xml ** – chandra Jun 05 '23 at 11:08

1 Answers1

0

Not the answer for your question, but lxml could maybe a solution?

from lxml import etree

fail_xml = """
<pdml version="0">
  <field name="test" show="nice <d83dde0a>;" />
</pdml>"""

parser = etree.XMLParser(recover=True) # recover from bad characters.
root = etree.fromstring(fail_xml, parser=parser)
for elem in root.iter():
    print(elem.tag, elem.attrib)

Output:

pdml {'version': '0'}
field {'name': 'test', 'show': 'nice '}
d83dde0a {}

EDIT: The module xml.sax.saxutils knows escape() and for attrs quoteattr() only. XML parser have problems with not well-formed xml. If you take a html Parser or bs4 it can be handeld:

import html
from html.parser import HTMLParser


class MyHTMLParser(HTMLParser):
    
    def handle_starttag(self, tag, attrs):
        if tag == "field":
            print(attrs)
            print(html.escape(attrs[1][1]))


parser = MyHTMLParser()
parser.feed("""<pdml version="0">
  <field name="test" show="nice <d83dde0a>;" />
</pdml>""")

Output:

[('name', 'test'), ('show', 'nice <d83dde0a>;')]
nice &lt;d83dde0a&gt;;
Hermann12
  • 1,709
  • 2
  • 5
  • 14
  • Thanks for your quick response I can see in the output "d83dde0a" also considered as the tag, but it is the content inside the field tag. I am looking a solution by using xml.sax parser as I am using the event handlers like startElement() endElement() for processing the values based on start and end of tag events. Are there any options available in xml.sax parser to skip the these special charectors "< >" in tag content, In xml.sax document I saw an escape() function for do this functionality, I am not sure how do I use that in my context. – chandra Jun 06 '23 at 05:40
  • Thanks for your support. My case is little different as here, I feel in html parser it accepts a string content for parsing. My requirement is to parse a huge data which may be having around 100k packet data. It won't be practical to use a xml string input, I am using as stdout PIPE stream of subprocess command execution for handling this. Eg: xml_parser.parse(data.stdout) Do you have any idea on a xml parser which take as PIPE stdout stream as input and which handle this parsing error. – chandra Jun 07 '23 at 06:41