3

I have pretty big XML documents, so I don't want to use DOM, but while parsing a document with SAX parser I want to stop at some point (let's say when I reached element with a certain name) and get everything inside that element as a string. "Everything" inside is not necessary a text node, it may contain tags, but I don't want them to me parsed, I just want to get them as text.

I'm writing in Python. Is it possible to solve? Thanks!

Fedor
  • 1,392
  • 1
  • 17
  • 30

4 Answers4

1

It does not seem to be offered by the xml.sax API, but you can utilize another way of interrupting control flow: exceptions.

Just define a custom exception for that purpose:

class FinishedParsing(Exception):
    pass

Raise this exception in your handler when you have finished parsing and simply ignore it.

try:
    parser.parse(xml)
except FinishedParsing:
    pass
Max
  • 1,387
  • 1
  • 15
  • 29
  • long time since i posted the question, but i think it was ending up doing exactly this. – Fedor Feb 22 '21 at 12:08
1

I don't believe it's possible with the xml.sax. BeautifulSoup has SoupStrainer which does exactly that. If you're open to using the library, it's quite easy to work with.

jeffknupp
  • 5,966
  • 3
  • 28
  • 29
  • This looks related to this question: http://stackoverflow.com/questions/4004979/using-soupstrainer-to-parse-selectively – Zach Young Jan 05 '12 at 15:53
  • Thank you for the answer. BeautifulSoup is a nice library, I'll consider using it if I won't find how to do that with SAX. Actually I'm surprised that "partial" parsing is not possible with SAX, you would simply need to raise a flag, which would say that text inside the current element needs to be returned as text. – Fedor Jan 05 '12 at 17:46
0

Here is a hackish way to do this, using SAX. This would keep the contents inside your text nodes. It gets more complicated if you need to keep the tags and attributes inside those text nodes though.

from xml.sax import handler, make_parser

class CustomContentHandler(handler.ContentHandler):

    def __init__(self):
        handler.ContentHandler.__init__(self)
        self.inside_text_tag = False
        self.text_content = []

    def startElement(self, name, attrs):
        if name == 'text':
            self.inside_text_tag = True

    def endElement(self, name):
        if name == 'text':
            self.inside_text_tag = False
            self.text = ''.join(self.text_content)
            print "%s" % (self.text)

    def characters(self, content):        
        if self.inside_text_tag:
            self.text_content.append(content)

def parse_file(filename):
    f = open(filename)
    parser = make_parser()
    ch = CustomContentHandler()
    parser.setContentHandler(ch)
    parser.parse(f)
    f.close()

if __name__ == "__main__":
    filename = "sample.xml"
    parse_file(filename)

Used against the following sample.xml file:

<tag1>
  <tag2>
    <title>XML</title>
    <text>
      Text001
      <h1>Header</h1>
      Text002
      <b>Text003</b>
    </text>
  </tag2>
</tag1>

would yield

Text001
Header
Text002
Text003
user635090
  • 1,401
  • 9
  • 17
-1

That's what CDATA sections are for.

http://www.w3schools.com/xml/xml_cdata.asp

You could use libxml_saxlib to properly handle CDATA sections.

http://www.rexx.com/~dkuhlman/libxml_saxlib.html

UPDATE: as a strictly temporary solution you can preprocess your input file to make it valid XML. Use 'sed' for example to insert CDATA tags in the appropriate places.

This does not solve the real problem, but gives you a parsable XML file, if you are lucky (eg. there are no surprises in the non-XML part of the file...).

egbokul
  • 3,944
  • 7
  • 36
  • 54
  • I know what is CDATA, but my XML documents are as they are, but I need to parse them "partially". – Fedor Jan 05 '12 at 17:37
  • but then you are assuming the O.P. has control over the creation of the xml files, which does not look to be the case – jsbueno Jan 05 '12 at 17:54
  • My take on this is that you should fix the real problem, not conceal it. The real problem is the "XML" file (it is not evena valid XML in this case), and not the parser. Why reinvent the wheel if there are standardized, official ways of doing something? You should talk to whoever produces this "crap" and convince them to do it the right way. – egbokul Jan 06 '12 at 08:14