1

I need to process large gzip-compressed text files.

InputStream is = new GZIPInputStream(new FileInputStream(path));
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String line;
while ((line = br.readLine()) != null) {
    someComputation();  
}

This code works if I don't do any long computations inside the loop (which I have to). But adding only a few milliseconds of sleep for each line causes the program to crash eventually with a java.util.zip.ZipException. The exception's message is different every time ("invalid literal/length code", "invalid block type", "invalid stored block lengths").
So, it seems that the stream becomes corrupted when I'm not reading it quickly enough.

I can unzip the files without any problems. I also tried GzipCompressorInputStream from Apache Commons Compress with the same result.
What is the problem here and how can it be solved?

update 1

I thought I had ruled this out, but doing more tests, I found that the problem is restricted to streaming files from the internet.

full example:

URL source = new URL(url);      
HttpURLConnection connection = (HttpURLConnection) source.openConnection();
connection.setRequestMethod("GET"); 
connection.setRequestProperty("Accept", "gzip, deflate"); 
BufferedReader br = new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream())));        
String line;
while ((line = br.readLine()) != null) { //exception is thrown here
    Thread.sleep(5);  
}

Interestingly, when I printed the line numbers, I found that it's always one of the same four or five different lines where my program crashes.


update 2

Here is a full example containing an actual file:

import java.io.BufferedInputStream;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.zip.GZIPInputStream;


public class TestGZIPStreaming {

    public static void main(String[] args) throws IOException {

        URL source = new URL("http://tools.wmflabs.org/wikidata-exports/rdf/exports/20151130/wikidata-statements.nt.gz");      
        HttpURLConnection connection = (HttpURLConnection) source.openConnection();
        connection.setRequestMethod("GET"); 
        connection.setRequestProperty("Accept", "gzip, deflate"); 
        BufferedReader br = new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream())));       

        String line;
        int n = 0;

        while ((line = br.readLine()) != null) { //exception is thrown here
            Thread.sleep(10);  
            System.out.println(++n);
        }

    }

}

For this file the crashes appear around line 90000.

To rule out a timeout problem I tried connection.setReadTimeout(0) - with no effect.

It probably is a network issue. But since I can download the file in a browser, there has to be a way to deal with it.

update 3

I tried connecting using Apache HttpClient.

HttpClient client = HttpClients.createDefault();
HttpGet get = new HttpGet("http://tools.wmflabs.org/wikidata-exports/rdf/exports/20151130/wikidata-statements.nt.gz");
get.addHeader("Accept-Encoding", "gzip");
HttpResponse response = client.execute(get);
BufferedReader br = new BufferedReader(new InputStreamReader(new GZIPInputStream(new BufferedInputStream(response.getEntity().getContent()))));

Now I'm getting the following exception which is probably more helpful.

org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 3850131; received: 1581056
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:180)
at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:238)
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.readLine(BufferedReader.java:324)
at java.io.BufferedReader.readLine(BufferedReader.java:389)

Again, there has to be a way to handle the problem since I can download the file in a browser and decompress it without any problem.

analina
  • 95
  • 1
  • 7
  • is this reproducible on other machines? – wero Nov 27 '15 at 09:27
  • You should post all relevant code. A [BufferedReader](http://docs.oracle.com/javase/8/docs/api/java/io/BufferedReader.html#constructor.summary) don't take an `InputStream` as constructor parameter. – SubOptimal Nov 27 '15 at 09:52
  • Just an aside, I would put the buffering between the file input stream and the gzip input stream. Also, this code snippet is probably not representative enough to find the source of the problem. Ideally we need a full working example that produces the unwanted result. – biziclop Nov 27 '15 at 10:15
  • @SubOptimal I forgot the InputStreamReader. It's corrected now. – analina Nov 27 '15 at 10:26
  • @analina Could you please have a look if you might be hit by this bugreport [JDK-6907252](https://bugs.openjdk.java.net/browse/JDK-6907252). From the posted code I would not think so, but your described symptoms are quite similar. – SubOptimal Nov 27 '15 at 15:28
  • @biziclop I added a full example – analina Nov 30 '15 at 16:05
  • @analina With your update what I suspect is happening is that the underlying network connection is closed after a certain period of inactivity. Maybe you can print out the time stamp for each line so that you can see how long it takes for the process to crash. If it's suspiciously close to a round number, like 60 seconds, it's probably some kind of timeout. – biziclop Nov 30 '15 at 16:16
  • @wero I tried it on another machine - with the same result. – analina Dec 12 '15 at 18:35
  • 1
    @biziclop I measured the time until the crashes happen, but couldn't find a consistent pattern. There was little but significant variance between the time spans and it took longer for bigger sleep times. Is there a way to deal with/prevent the connection being closed? – analina Dec 12 '15 at 19:08

0 Answers0