How should I handle decompression of a large (70MB uncompressed) byte stream without overflowing heap?

Question

I'm working on implementing GZIP compression for interactions between some of our systems. The systems are written in both Java and C#, so GZIP streams were used on both sides since they have standard library support.

On the C# side, everything works up to and including our biggest test files (70MB uncompressed), however we run into issues with Java running out of heap space. We've tried increasing the heap size to capacity for the IDE, but the issue is still not resolved.

I've taken some steps to try and optimize the Java code, but nothing seems to keep the data from piling up in the heap. Is there a good way to handle this? Below is a subset of my current (working on smaller streams) solution.

EDIT: Following code modified with recommendations from @MarkoTopolnik. With changes, 17 million characters are read before crash.

public static String decompress(byte[] compressed, int size)
{
    GZIPInputStream decompresser;
    BufferedReader reader;
    char buf[] = new char[(size < 2048) ? size : 2048];
    Writer ret = new StringWriter( buf.length );

    decompresser = new GZIPInputStream( new ByteArrayInputStream( compressed ), buf.length );
    reader = new BufferedReader( new InputStreamReader( decompresser, "UTF-8" ) );

    int charsRead;
    while( (charsRead = reader.read( buf, 0, buf.length )) != -1 )
    {
        ret.write( buf, 0, charsRead );
    }
    decompresser.close();
    reader.close();

    return ret.toString();
}

The code dies after hitting a little over 7.6 million chars in the ArrayList and the stack trace indicates that the ArrayList.add() call is the cause (fails after triggering the internal array to be expanded).

With the edited code above, a call to AbstractStringBuilder.expandCapacity() is what kills the program.

Is there a less memory-expensive way to implement a dynamic array or some completely different approach I can use to get a String from the decompressed stream? Any suggestions would be greatly appreciated!

score 3 · Answer 1 · answered May 30 '13 at 19:06

3

I'd chunk it rather than reading the whole thing into memory: read in a 1024 byte buffer at a time and immediately write it out, more like a Unix pipe than a two step read/write process.

answered May 30 '13 at 19:06

duffymo

305,152
44
369
561

1

This is often not an option due to existing framework constraints. I'd say OP has such a case. – Marko Topolnik May 30 '13 at 19:28
Yes, I can't see a way to implement that with our pre-existing framework. Function must return a String and must compress the given byte array. With those constraints I can's see a way to implement your solution (correct me if I'm wrong). – Dan May 30 '13 at 19:47
Sounds like you're right. I'd recommend using VisualVM, with all plugins installed, to see where memory is being consumed. It might be time to ditch your framework. – duffymo May 30 '13 at 19:54

score 3 · Answer 2 · answered May 30 '13 at 19:24

3

Oh yes, there are far more efficient ways. The most glaring inefficiency in your code is that you create an ArrayList<Character>. This means that each character takes up about 30 bytes of memory. Multiplied by your 7.6 million, it's 250 MB.

What you must use is a StringWriter and its method write(char[],int,int), which you can call with the same buffer that you already have. This will be about 25 times more memory-efficient.

answered May 30 '13 at 19:24

Marko Topolnik

195,646
29
319
436

Thanks for that! Got me 10 million more characters before heap was exceeded. Doesn't completely solve the problem, but a good start. – Dan May 30 '13 at 19:37
That's surprisingly little. I think it is because I didn't account for `Character` caching. – Marko Topolnik May 30 '13 at 19:39

How should I handle decompression of a large (70MB uncompressed) byte stream without overflowing heap?

2 Answers2