3

I've a problem with SAX and Java.

I'm parsing the dblp digital library database xml file (which enumerates journal, conferences, paper). The XML file is very large (> 700MB).

However, my problem is that when the callback characters() returns, if the string retrieved contains several entities, the method only returns the string starting from the last entity characters found.

i.e.: R&uuml;diger Mecke is the original author name held between <author> tags

üdiger Mecke is the result

(The String returned from characters (ch[], start, length) method).

I would like to know:

  1. how to prevent the PArser to automatically resolve entities?
  2. how to solve the truncated characters problem previously described?
McDowell
  • 107,573
  • 31
  • 204
  • 267
user278064
  • 9,982
  • 1
  • 33
  • 46

2 Answers2

4

characters() is not guaranteed to return of all the characters in a single call. From the Javadoc:

The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks.

You need to append the characters returned in all of the calls, something like:

private StringBuffer tempValue = new StringBuffer();

startElement()
{
    tempValue.setLength(0); // clear buffer...
}

characters(characters(char[] ch, int start, int length)
{
    tempValue.append(ch, start, length); // append to buffer
}

endElement()
{
    String value = tempValue.toString(); // use characters in buffer...
}
Quentin Pradet
  • 4,691
  • 2
  • 29
  • 41
  • What determines the chunks that the parser returns? My file contains a ", and that seems to delimit parsing. – Lord Cat Jul 18 '16 at 09:54
2
  1. I don't think you can turn off entity resolution.

  2. The characters method can be called multiple times for a single tag, and you have to collect the characters across the multiple calls rather than expecting them all to arrive at once.

Don Roby
  • 40,677
  • 6
  • 91
  • 113
  • ok, but why characters method is called multiple time only for text node holding entities? – user278064 Dec 29 '10 at 17:32
  • I don't believe that's the only thing that will cause multiple calls, but I know it does in most implementations of sax. Long blocks might also be split. – Don Roby Dec 29 '10 at 17:36