GZIPInputStream end-of-file sequence in BufferedReader

Question

I use a Java BufferedReader object read, line-by-line, a GZIPInputStream that points to a valid GZIP archive that contains 1,000 lines of ASCII text, in typical CSV format. The code looks like this:

BufferedReader buffer = new BufferedReader(new InputStreamReader(
                        new GZIPInputStream(new FileInputStream(file))));

where file is the actual File object pointing to the archive.

I read through all the file by calling

int count = 0;
String line = null;

while ((line = reader.readLine()) != null)
{
    count++;
}

and the reader goes over the file as expected, but at the end it bypasses line #1000 and reads one more line (i.e., count = 1001 after ending the loop).

Calling line.length() on the last line reports a large number (4,000+) of characters, all of which are non-printable (Character.getNumericValue() returns -1).

Actually, if I do line.getBytes() the resulting byte[] array has an equal number of NULL characters ('\0').

Does this seem like a bug in BufferedReader?

In any case, can anyone please suggest a workaround to bypass this behavior?

EDIT: More weird behavior: The first line read is prefixed by the filename, several NULL characters ('\0') and things line username and group name, then the actual text follows!

EDIT: I have created a very simple test class that reproduces the effect I described above, at least on my platform.

EDIT: Apparently false alarm, the file I was getting was not plain GZIP but tarred GZIP, so this explains it, no need for further testing. Thanks everyone!

Ways to debug: extract the file with external `gzip`, and leave out the `GZIPInputStream` - and look at the extracted file. It could be that your gzip file is faulty, or that InputStreamReader or BufferedReader have a bug. Or GzipInputStream. — Paŭlo Ebermann, Jun 28 '11 at 12:43
You are speaking of a plain gzip file, but your code refers to a `tar.gz` file . Why? Are you aware that to tar + gz a file is not the same as gzip it ? — leonbloy, Jun 28 '11 at 20:43
since you had a answer to your problem, it would be a good idea to acept one of the answers, vote up the answers who helped you etc. — woliveirajr, Jun 29 '11 at 14:03
Off-topic, I must say I really cannot see how a valid question should ever be down-voted. True, there was no "bug" in the first place as the OP did not at first see that he was trying to parse GZIP files and had in fact been inadvertently sending in tarballs. However, the question was legitimate and detailed, received a valid and thorough answer, the OP did see his error and he was polite enough to discard his bug report on SourceForge. So....an upvote to both OP and the answer is the least I can do.... EDIT: a note to OP - please, do accept the answer if it proved useful... — quantum, Jul 24 '11 at 16:30

score 3 · Answer 1 · edited May 23 '17 at 12:19

I think I found your problem.

I tried to reproduce it with your source in the question, and got this output:

-------------------------------------
        Reading PLAIN file
-------------------------------------

Printable part of line 1:       This, is, line, number, 1

Line start (<= 25 characters): This__is__line__number__1

No NULL characters in line 1

Other information on line 1:
        Length: 25
        Bytes: 25
        First byte: 84

Printable part of line 10:      This, is, line, number, 10

Line start (<= 26 characters): This__is__line__number__10

No NULL characters in line 10

Other information on line 10:
        Length: 26
        Bytes: 26
        First byte: 84

File lines read: 10

-------------------------------------
        Reading GZIP file
-------------------------------------

Printable part of line 1:       This, is, line, number, 1

Line start (<= 25 characters): This__is__line__number__1

No NULL characters in line 1

Other information on line 1:
        Length: 25
        Bytes: 25
        First byte: 84

Printable part of line 10:      This, is, line, number, 10

Line start (<= 26 characters): This__is__line__number__10

No NULL characters in line 10

Other information on line 10:
        Length: 26
        Bytes: 26
        First byte: 84

File lines read: 10

-------------------------------------
        TOTAL READ
-------------------------------------

Plain: 10, GZIP: 10

I think this is not what you are having. Why? You are using a tar.gz file. This is the tar archive format, and additionally the gzip compression. The GZipInputStream undoes the gzip compression, but knows nothing about the tar archive format.

tar is normally used to pack multiple files together - in an uncompressed format, but together with some metadata, which is what you observe:

EDIT: More weird behavior: The first line read is prefixed by the filename, several NULL characters ('\0') and things line username and group name, then the actual text follows!

If you have a tar file, you need to use a tar decoder. How do I extract a tar file in Java? gives some links (like using the Tar task from Ant), also there is JTar.

If you want to send only one file, better use the gzip format directly (this was what I did in my test).

But there is no bug anywhere, apart from you expecting the gzip-stream to read the tar format.

Sure. I could not believe this was a bug, which is why I asked for help. Thanks again! — PNS, Jun 28 '11 at 20:54
This could have solved right at the start if you had somehow shown your file. (I added some Java tar links to the answer.) — Paŭlo Ebermann, Jun 28 '11 at 20:58

GZIPInputStream end-of-file sequence in BufferedReader

1 Answers1