2

I'm working on implementing GZIP compression for interactions between some of our systems. The systems are written in both Java and C#, so GZIP streams were used on both sides since they have standard library support.

On the C# side, everything works up to and including our biggest test files (70MB uncompressed), however we run into issues with Java running out of heap space. We've tried increasing the heap size to capacity for the IDE, but the issue is still not resolved.

I've taken some steps to try and optimize the Java code, but nothing seems to keep the data from piling up in the heap. Is there a good way to handle this? Below is a subset of my current (working on smaller streams) solution.

EDIT: Following code modified with recommendations from @MarkoTopolnik. With changes, 17 million characters are read before crash.

public static String decompress(byte[] compressed, int size)
{
    GZIPInputStream decompresser;
    BufferedReader reader;
    char buf[] = new char[(size < 2048) ? size : 2048];
    Writer ret = new StringWriter( buf.length );

    decompresser = new GZIPInputStream( new ByteArrayInputStream( compressed ), buf.length );
    reader = new BufferedReader( new InputStreamReader( decompresser, "UTF-8" ) );

    int charsRead;
    while( (charsRead = reader.read( buf, 0, buf.length )) != -1 )
    {
        ret.write( buf, 0, charsRead );
    }
    decompresser.close();
    reader.close();

    return ret.toString();
}

The code dies after hitting a little over 7.6 million chars in the ArrayList and the stack trace indicates that the ArrayList.add() call is the cause (fails after triggering the internal array to be expanded).

With the edited code above, a call to AbstractStringBuilder.expandCapacity() is what kills the program.

Is there a less memory-expensive way to implement a dynamic array or some completely different approach I can use to get a String from the decompressed stream? Any suggestions would be greatly appreciated!

Dan
  • 3,246
  • 1
  • 32
  • 52

2 Answers2

3

I'd chunk it rather than reading the whole thing into memory: read in a 1024 byte buffer at a time and immediately write it out, more like a Unix pipe than a two step read/write process.

duffymo
  • 305,152
  • 44
  • 369
  • 561
  • 1
    This is often not an option due to existing framework constraints. I'd say OP has such a case. – Marko Topolnik May 30 '13 at 19:28
  • Yes, I can't see a way to implement that with our pre-existing framework. Function must return a String and must compress the given byte array. With those constraints I can's see a way to implement your solution (correct me if I'm wrong). – Dan May 30 '13 at 19:47
  • Sounds like you're right. I'd recommend using VisualVM, with all plugins installed, to see where memory is being consumed. It might be time to ditch your framework. – duffymo May 30 '13 at 19:54
3

Oh yes, there are far more efficient ways. The most glaring inefficiency in your code is that you create an ArrayList<Character>. This means that each character takes up about 30 bytes of memory. Multiplied by your 7.6 million, it's 250 MB.

What you must use is a StringWriter and its method write(char[],int,int), which you can call with the same buffer that you already have. This will be about 25 times more memory-efficient.

Marko Topolnik
  • 195,646
  • 29
  • 319
  • 436
  • Thanks for that! Got me 10 million more characters before heap was exceeded. Doesn't completely solve the problem, but a good start. – Dan May 30 '13 at 19:37
  • That's surprisingly little. I think it is because I didn't account for `Character` caching. – Marko Topolnik May 30 '13 at 19:39