I am currently evaluating EXI to compress large XML files. Large means an XML file with 20 GB (twenty).
Both EXI compression codecs and non-EXI compression codecs (gzip/lzma ) are integrated in a Scala application running on a Java virtual machine. GZIP and LZMA are provided by commons-compress. All codecs are implemented in Java in these 3rd party libraries.
On a 64-Bit Linux system with 8 GB (6 GB for the JVM), both Exificient and OpenExi can encode, but fail to decode, when the original XML file is about 10 GB.
- Exificient fails with OutOfMemory
- OpenExi fails with an ArrayIndexOutOfBoundsException: 1000000
No problems with GZIP/LZMA
Oracle JDK: 1.8-8u40
JVM args:
-Xmx6g -XX:+UseG1GC -XX:+UseStringDeduplication
The resulting EXI-encoded XML file has a size of ~ 70 MB
My questions:
- Does EXI imply (due to it's underlying algorithm) that memory usage increases along the XML input file size? If so, is there a simple formula to calculate the required memory?
- Is there anything one can do to make it work (except for assigning more memory)?