OOM when trying to process s3 file

Question

I am trying to use below code to download and read data from file, any how this goes OOM, exactly while reading the file, the size of s3 file is 22MB, I downloaded through browser it is 650 MB, but when I monitor through visual VM, memory consumed while uncompressing and reading is more than 2GB. Anyone please guide so that I would find the reason of high memory usage. Thanks.

public static String unzip(InputStream in) throws IOException, CompressorException, ArchiveException {
            System.out.println("Unzipping.............");
            GZIPInputStream gzis = null;
            try {
                gzis = new GZIPInputStream(in);
                InputStreamReader reader = new InputStreamReader(gzis);
                BufferedReader br = new BufferedReader(reader);
                double mb = 0;
                String readed;
                int i=0;
                while ((readed = br.readLine()) != null) {
                     mb = mb+readed.getBytes().length / (1024*1024);
                     i++;
                     if(i%100==0) {System.out.println(mb);}
                }


            } catch (IOException e) {
                e.printStackTrace();
                LOG.error("Invoked AWSUtils getS3Content : json ", e);
            } finally {
                closeStreams(gzis, in);
            }

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3332) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596) at java.lang.StringBuffer.append(StringBuffer.java:367) at java.io.BufferedReader.readLine(BufferedReader.java:370) at java.io.BufferedReader.readLine(BufferedReader.java:389) at com.kpmg.rrf.utils.AWSUtils.unzip(AWSUtils.java:917)

Please [edit] your question to include the actual exception that you're getting, including the stacktrace. Indicate which line of the code that you posted is throwing the exception. — Kenster, Nov 08 '19 at 14:33
Are you saying the unzipped file is 650 MB, and your VM uses 2 GB before running OOM? — AndyMan, Nov 08 '19 at 14:46
@AndyMan JVM uses more than 2 GB, to cross check I downloaded through browser from s3(where file size is 22MB), after getting downloaded it was around 650 MB on Disk — Aadam, Nov 08 '19 at 14:47
Is this the real code causing the problems? It seems like something would be missing from it, namely the code that actually does something with the data you are reading in. All you posted here is some logic which counts the number of megabytes. — Gimby, Nov 08 '19 at 14:51
@Gimby I am sure, this is the code that causes OOM, actually i was trying to convert to string, then I thought that might be causing the issue, so I added some code to check in MB, any how that is a huge json file, without any new lines, so I was not able to check the size even — Aadam, Nov 08 '19 at 14:59
Well then memory is leaking away somewhere outside of this code... — Gimby, Nov 08 '19 at 15:03
@Gimby it crashed before it could complete reading, are you sure about it? — Aadam, Nov 08 '19 at 15:08
Also I am sure that no other thread is running to consume the heap space. — Aadam, Nov 08 '19 at 15:15
Ok guys, After modifying the code, I can see that it is taking 1600 MB to read file, anyhow it goes OOM when I try to convert byte array to string. — Aadam, Nov 08 '19 at 15:37

Stephen C · Accepted Answer · 2019-11-08T16:02:42.533

This is a theory, but I can't think of any other reasons why your example would OOM.

Suppose that the uncompressed file consists contains a very long line; e.g. something like 650 million ASCII bytes.

Your application seems to just read the file a line at a time and (try to) display a running total of the megabytes that have been read.

Internally, the readLine() method reads characters one at a time and appends them to a StringBuffer. (You can see the append call in the stack trace.) If the file consist of a very large line, then the StringBuffer is going to get very large.

Each text character in the uncompressed string becomes a char in the char[] that is the buffer part of the StringBuffer.
Each time the buffer fills up, StringBuffer will grow the buffer by (I think) doubling its size. This entails allocating a new char[] and copying the characters to it.
So if the buffer fills when there are N characters, Arrays.copyOf will allocate a char[] hold 2 x N characters. And while the data is being copied, a total of 3 x N of character storage will be in use.
So 650MB could easily turn into a heap demand of > 6 x 650M bytes

The other thing to note that the 2 x N array has to be a single contiguous heap node.

Looking at the heap graphs, it looks like the heap got to ~1GB in use. If my theory is correct, the next allocation would have been for a ~2GB node. But 1GB + 2GB is right on the limit for your 3.1GB heap max. And when we take the contiguity requirement into account, the allocation cannot be done.

So what is the solution?

It is simple really: don't use readLine() if it is possible for lines to be unreasonably long.

    public static String unzip(InputStream in) 
            throws IOException, CompressorException, ArchiveException {
        System.out.println("Unzipping.............");
        try (
            GZIPInputStream gzis = new GZIPInputStream(in);
            InputStreamReader reader = new InputStreamReader(gzis);
            BufferedReader br = new BufferedReader(reader);
        ) {
            int ch;
            long i = 0;
            while ((ch = br.read()) >= 0) {
                 i++;
                 if (i % (100 * 1024 * 1024) == 0) {
                     System.out.println(i / (1024 * 1024));
                 }
            }
        } catch (IOException e) {
            e.printStackTrace();
            LOG.error("Invoked AWSUtils getS3Content : json ", e);
        }

Altogether I need the whole string not just the size, for that I have to use StringBuilder/Buffer, that inturn will again cause OOM — Aadam, Nov 08 '19 at 16:24
That's right. You are going to have to rethink the way that you do this. 1) Don't use `readline`. 2) Don't store it as a `String`. 3) A `StringBuilder` may work if you preallocate it to a large enough size to hold the uncompressed text. 4) You may be able to save space by not converting to / storing as `char` data. — Stephen C, Nov 13 '19 at 00:46

score 0 · Answer 2 · answered Nov 08 '19 at 16:16

I also thought of the too long line. On second thought I think the StringBuffer that is used internally by the JVM needs to be converted to the result type of readline: a String. Strings are immutable, but for speed reasons the JVM would not even lookup if a line is duplicate. So it may allocate the String many times, ultimately filling up the heap with no longer used String fragments.

My recommendation would be not to read lines or characters, but chunks of bytes. A byte[] is allocated on the heap and can be thrown away afterwards. Of course you would then count bytes instead of characters. Unless you know the difference and need characters that could be the more stable and performant solution.

This code is just written by memory and not tested:

public static String unzip(InputStream in) 
            throws IOException, CompressorException, ArchiveException {
        System.out.println("Unzipping.............");
        try (
            GZIPInputStream gzis = new GZIPInputStream(in);
        ) {
            byte[] buffer = new byte[8192];
            long i = 0;
            int read = gzis.read(buffer);
            while (read >= 0) {
                 i+=read;
                 if (i % (100 * 1024 * 1024) == 0) {
                     System.out.println(i / (1024 * 1024));
                 }
                 read = gzis.read(buffer);
            }
        } catch (IOException e) {
            e.printStackTrace();
            LOG.error("Invoked AWSUtils getS3Content : json ", e);
        }```

OOM when trying to process s3 file

2 Answers2