I've a problem with SAX and Java.
I'm parsing the dblp digital library database xml file (which enumerates journal, conferences, paper). The XML file is very large (> 700MB).
However, my problem is that when the callback characters() returns, if the string retrieved contains several entities, the method only returns the string starting from the last entity characters found.
i.e.: Rüdiger Mecke
is the original author name held between <author>
tags
üdiger Mecke
is the result
(The String returned from characters (ch[], start, length) method).
I would like to know:
- how to prevent the PArser to automatically resolve entities?
- how to solve the truncated characters problem previously described?