0

I am currently evaluating EXI to compress large XML files. Large means an XML file with 20 GB (twenty).

Both EXI compression codecs and non-EXI compression codecs (gzip/lzma ) are integrated in a Scala application running on a Java virtual machine. GZIP and LZMA are provided by commons-compress. All codecs are implemented in Java in these 3rd party libraries.

On a 64-Bit Linux system with 8 GB (6 GB for the JVM), both Exificient and OpenExi can encode, but fail to decode, when the original XML file is about 10 GB.

  • Exificient fails with OutOfMemory
  • OpenExi fails with an ArrayIndexOutOfBoundsException: 1000000
  • No problems with GZIP/LZMA

  • Oracle JDK: 1.8-8u40

  • JVM args: -Xmx6g -XX:+UseG1GC -XX:+UseStringDeduplication

  • The resulting EXI-encoded XML file has a size of ~ 70 MB

My questions:

  • Does EXI imply (due to it's underlying algorithm) that memory usage increases along the XML input file size? If so, is there a simple formula to calculate the required memory?
  • Is there anything one can do to make it work (except for assigning more memory)?
Beryllium
  • 12,808
  • 10
  • 56
  • 86

1 Answers1

1

The EXI format offers "options" to restrict memory usage.

https://www.w3.org/TR/exi/#options

valueMaxLength and valuePartitionCapacity restrict the length and the number of entries in an EXI string table.

For example setting valueMaxLength to 16 means that no string is added to the table if larger than 16. String tables may grow while processing and need to be kept in memory till to the end.

The option valuePartitionCapacity restricts the number of strings in the table (round robin fashion).

When EXI compression is used, please consider to reduce also blockSize.

Hope this helps,

-- Daniel

Beryllium
  • 12,808
  • 10
  • 56
  • 86