2

I have some issues with parsing xml files by sax.

The Java contenthandler code looks like this:

boolean rcontent = false;

@Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
    if (qName.equalsIgnoreCase("content")) {
        rcontent = true;
    }
}

@Override
public void characters(char ch[], int start, int length) throws SAXException {
    if (rcontent){
        System.out.println("content: " + new String(ch, start, length));
        rcontent = false;
    }
}

Xml file content is like this: enter image description here

But the output is:

I want to say

which is not complete.

Nathan Hughes
  • 94,330
  • 19
  • 181
  • 276

1 Answers1

2

It's likely that characters(...) is being called multiple times for the single <content> block. Try something like

StringBuilder builder;

@Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
    if (qName.equalsIgnoreCase("content")) {
        builder = new StringBuilder();
    }
}

@Override
public void characters(char ch[], int start, int length) throws SAXException {
    if (builder != null){
        builder.append(new String(ch, start, length));
    }
}

@Override
public void endElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
    if (builder != null) {
        System.out.println("Content = " + builder);
        builder = null;
    }
}
lance-java
  • 25,497
  • 4
  • 59
  • 101
  • Cheers! It works very well! But why characters(...) is being called multiple times for a single tag? Is it because it meet &amp or it has a largest size? – Jintao Wang Aug 30 '17 at 13:31
  • Read the [javadocs](https://docs.oracle.com/javase/7/docs/api/org/xml/sax/helpers/DefaultHandler.html#characters(char[],%20int,%20int)) which state that the character data is "chunked". It's usually done to avoid having large character arrays in memory unnecessarily. I'm guessing different sax parsers could chose to chunk the characters differently so you shouldn't rely on the chunking implementation. – lance-java Aug 30 '17 at 13:42