Can I somehow tell to SAX parser to stop at some element and get its child nodes as a string?

Question

I have pretty big XML documents, so I don't want to use DOM, but while parsing a document with SAX parser I want to stop at some point (let's say when I reached element with a certain name) and get everything inside that element as a string. "Everything" inside is not necessary a text node, it may contain tags, but I don't want them to me parsed, I just want to get them as text.

I'm writing in Python. Is it possible to solve? Thanks!

Max · Accepted Answer · 2021-02-20T15:10:17.630

1

It does not seem to be offered by the xml.sax API, but you can utilize another way of interrupting control flow: exceptions.

Just define a custom exception for that purpose:

class FinishedParsing(Exception):
    pass

Raise this exception in your handler when you have finished parsing and simply ignore it.

try:
    parser.parse(xml)
except FinishedParsing:
    pass

edited Feb 20 '21 at 15:10

answered Feb 20 '21 at 12:45

Max

1,387
1
15
29

long time since i posted the question, but i think it was ending up doing exactly this. – Fedor Feb 22 '21 at 12:08

score 1 · Answer 2 · answered Jan 05 '12 at 15:25

1

I don't believe it's possible with the xml.sax. BeautifulSoup has SoupStrainer which does exactly that. If you're open to using the library, it's quite easy to work with.

answered Jan 05 '12 at 15:25

jeffknupp

5,966
3
28
29

This looks related to this question: http://stackoverflow.com/questions/4004979/using-soupstrainer-to-parse-selectively – Zach Young Jan 05 '12 at 15:53
Thank you for the answer. BeautifulSoup is a nice library, I'll consider using it if I won't find how to do that with SAX. Actually I'm surprised that "partial" parsing is not possible with SAX, you would simply need to raise a flag, which would say that text inside the current element needs to be returned as text. – Fedor Jan 05 '12 at 17:46

score 0 · Answer 3 · answered Aug 26 '12 at 19:45

Here is a hackish way to do this, using SAX. This would keep the contents inside your text nodes. It gets more complicated if you need to keep the tags and attributes inside those text nodes though.

from xml.sax import handler, make_parser

class CustomContentHandler(handler.ContentHandler):

    def __init__(self):
        handler.ContentHandler.__init__(self)
        self.inside_text_tag = False
        self.text_content = []

    def startElement(self, name, attrs):
        if name == 'text':
            self.inside_text_tag = True

    def endElement(self, name):
        if name == 'text':
            self.inside_text_tag = False
            self.text = ''.join(self.text_content)
            print "%s" % (self.text)

    def characters(self, content):        
        if self.inside_text_tag:
            self.text_content.append(content)

def parse_file(filename):
    f = open(filename)
    parser = make_parser()
    ch = CustomContentHandler()
    parser.setContentHandler(ch)
    parser.parse(f)
    f.close()

if __name__ == "__main__":
    filename = "sample.xml"
    parse_file(filename)

Used against the following sample.xml file:

<tag1>
  <tag2>
    <title>XML</title>
    <text>
      Text001
      <h1>Header</h1>
      Text002
      <b>Text003</b>
    </text>
  </tag2>
</tag1>

would yield

Text001
Header
Text002
Text003

egbokul · Answer 4 · 2012-01-06T08:43:21.540

-1

That's what CDATA sections are for.

http://www.w3schools.com/xml/xml_cdata.asp

You could use libxml_saxlib to properly handle CDATA sections.

http://www.rexx.com/~dkuhlman/libxml_saxlib.html

UPDATE: as a strictly temporary solution you can preprocess your input file to make it valid XML. Use 'sed' for example to insert CDATA tags in the appropriate places.

This does not solve the real problem, but gives you a parsable XML file, if you are lucky (eg. there are no surprises in the non-XML part of the file...).

edited Jan 06 '12 at 08:43

answered Jan 05 '12 at 16:28

egbokul

3,944
7
36
54

I know what is CDATA, but my XML documents are as they are, but I need to parse them "partially". – Fedor Jan 05 '12 at 17:37
but then you are assuming the O.P. has control over the creation of the xml files, which does not look to be the case – jsbueno Jan 05 '12 at 17:54
My take on this is that you should fix the real problem, not conceal it. The real problem is the "XML" file (it is not evena valid XML in this case), and not the parser. Why reinvent the wheel if there are standardized, official ways of doing something? You should talk to whoever produces this "crap" and convince them to do it the right way. – egbokul Jan 06 '12 at 08:14

Can I somehow tell to SAX parser to stop at some element and get its child nodes as a string?

4 Answers4