I need to process large gzip-compressed text files.
InputStream is = new GZIPInputStream(new FileInputStream(path));
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String line;
while ((line = br.readLine()) != null) {
someComputation();
}
This code works if I don't do any long computations inside the loop (which I have to). But adding only a few milliseconds of sleep for each line causes the program to crash eventually with a java.util.zip.ZipException. The exception's message is different every time ("invalid literal/length code", "invalid block type", "invalid stored block lengths").
So, it seems that the stream becomes corrupted when I'm not reading it quickly enough.
I can unzip the files without any problems. I also tried GzipCompressorInputStream from Apache Commons Compress with the same result.
What is the problem here and how can it be solved?
update 1
I thought I had ruled this out, but doing more tests, I found that the problem is restricted to streaming files from the internet.
full example:
URL source = new URL(url);
HttpURLConnection connection = (HttpURLConnection) source.openConnection();
connection.setRequestMethod("GET");
connection.setRequestProperty("Accept", "gzip, deflate");
BufferedReader br = new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream())));
String line;
while ((line = br.readLine()) != null) { //exception is thrown here
Thread.sleep(5);
}
Interestingly, when I printed the line numbers, I found that it's always one of the same four or five different lines where my program crashes.
update 2
Here is a full example containing an actual file:
import java.io.BufferedInputStream;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.zip.GZIPInputStream;
public class TestGZIPStreaming {
public static void main(String[] args) throws IOException {
URL source = new URL("http://tools.wmflabs.org/wikidata-exports/rdf/exports/20151130/wikidata-statements.nt.gz");
HttpURLConnection connection = (HttpURLConnection) source.openConnection();
connection.setRequestMethod("GET");
connection.setRequestProperty("Accept", "gzip, deflate");
BufferedReader br = new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream())));
String line;
int n = 0;
while ((line = br.readLine()) != null) { //exception is thrown here
Thread.sleep(10);
System.out.println(++n);
}
}
}
For this file the crashes appear around line 90000.
To rule out a timeout problem I tried connection.setReadTimeout(0)
- with no effect.
It probably is a network issue. But since I can download the file in a browser, there has to be a way to deal with it.
update 3
I tried connecting using Apache HttpClient.
HttpClient client = HttpClients.createDefault();
HttpGet get = new HttpGet("http://tools.wmflabs.org/wikidata-exports/rdf/exports/20151130/wikidata-statements.nt.gz");
get.addHeader("Accept-Encoding", "gzip");
HttpResponse response = client.execute(get);
BufferedReader br = new BufferedReader(new InputStreamReader(new GZIPInputStream(new BufferedInputStream(response.getEntity().getContent()))));
Now I'm getting the following exception which is probably more helpful.
org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 3850131; received: 1581056
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:180)
at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:238)
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.readLine(BufferedReader.java:324)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
Again, there has to be a way to handle the problem since I can download the file in a browser and decompress it without any problem.