Why does SAXParser read so much before throwing events?

Question

Scenario: I'm receiving a huge xml file via extreme slow network so I want so start the excessive processing as early as possible. Because of that I decided to use SAXParser.

I expected that after a tag is finished I will get an event.

The following test shows what I mean:

@Test
public void sax_parser_read_much_things_before_returning_events() throws Exception{
    String xml = "<a>"
               + "  <b>..</b>"
               + "  <c>..</c>"
                  // much more ...
               + "</a>";

    // wrapper to show what is read
    InputStream is = new InputStream() {
        InputStream is = new ByteArrayInputStream(xml.getBytes());

        @Override
        public int read() throws IOException {
            int val = is.read();
            System.out.print((char) val);
            return val;
        }
    };

    SAXParser parser = SAXParserFactory.newInstance().newSAXParser();
    parser.parse(is, new DefaultHandler(){
        @Override
        public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
            System.out.print("\nHandler start: " + qName);
        }

        @Override
        public void endElement(String uri, String localName, String qName) throws SAXException {
            System.out.print("\nHandler end: " + qName);
        }
    });
}

I wrapped the input stream to see what is read and when the events occur.

What I expected was something like this:

<a>                    <- output from read()
Handler start: a
<b>                    <- output from read()
Handler start: b
</b>                   <- output from read()
Handler end: b
...

Sadly the result was following:

<a>  <b>..</b>  <c>..</c></a>        <- output from read()
Handler start: a
Handler start: b
Handler end: b
Handler start: c
Handler end: c
Handler end: a

Where is my mistake and how can I get the expected result?

Edit:

First thing is that he's trying to detect the doc version, which causes to scan everything. With doc version he breaks in between (but not where I expect)
It is not ok that he "wants to" read for example 1000 bytes and blocks for so long because its possible that stream doesn't contain so much at this point of time.
I found the buffer sizes in XMLEntityManager:
- public static final int DEFAULT_BUFFER_SIZE = 8192;
- public static final int DEFAULT_XMLDECL_BUFFER_SIZE = 64;
- public static final int DEFAULT_INTERNAL_BUFFER_SIZE = 1024;

I think you should try a bugger test file - I suspect that a buffered read is effectively reading your entire file before it starts processing because it would buffer the file in (say) 1k chunks or whatever - if you use a large file you may get something more like you expect. — Elemental, Oct 20 '15 at 10:40

Holger · Accepted Answer · 2015-10-20T14:01:20.673

It seems you are making wrong assumptions about how the I/O works. An XML parser, like most software, will request data in chunks, because requesting single bytes from a stream is a recipe for a performance disaster.

This does not imply that the buffer must get completely filled before a read attempt returns. It’s just, that a ByteArrayInputStream is incapable of emulating the behavior of a network InputStream. You can easily fix that by overriding the read(byte[], int, int) and not returning a complete buffer but, e.g. a single byte on every request:

@Test
public void sax_parser_read_much_things_before_returning_events() throws Exception{
    final String xml = "<a>"
               + "  <b>..</b>"
               + "  <c>..</c>"
                  // much more ...
               + "</a>";

    // wrapper to show what is read
    InputStream is = new InputStream() {
        InputStream is = new ByteArrayInputStream(xml.getBytes());

        @Override
        public int read() throws IOException {
            int val = is.read();
            System.out.print((char) val);
            return val;
        }
        @Override
        public int read(byte[] b, int off, int len) throws IOException {
            return super.read(b, off, 1);
        }
    };

    SAXParser parser = SAXParserFactory.newInstance().newSAXParser();
    parser.parse(is, new DefaultHandler(){
        @Override
        public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
            System.out.print("\nHandler start: " + qName);
        }

        @Override
        public void endElement(String uri, String localName, String qName) throws SAXException {
            System.out.print("\nHandler end: " + qName);
        }
    });
}

This will print

<a>  
Handler start: a<b>
Handler start: b..</b>
Handler end: b  <c>
Handler start: c..</c>
Handler end: c</a>
Handler end: a?

showing, how the XML parser adapts to the availability of data from the InputStream.

The `read(byte[], int, int)` can be simplified as `return super.read(b, off, 1);`. — Didier L, Oct 20 '15 at 11:38

score 1 · Answer 2 · answered Oct 20 '15 at 10:37

Internally the SAX parser most probably has wrapped your InputStream in a BufferedReader or uses some sort of buffering. Else it would read single bytes from the input which would really hurt performance.

So what you are seeing is that the parser reads a chunk from the input and then processes that part, issuing the SAX events, and so on...

Why does SAXParser read so much before throwing events?

2 Answers2