Determining GZIPOutputStream behavior

Question

The following code produces files which is deterministic (shasum is the same) for two strings.

    try(
            FileOutputStream fos = new FileOutputStream(saveLocation);
            GZIPOutputStream zip = new GZIPOutputStream(fos, GZIP_BUFFER_SIZE);
            BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(zip, StandardCharsets.UTF_8));
            ){
        writer.append(str);
    }

Produces:

a.gz f0200d53f7f9b35647b5dece0146d72cd1c17949

However, if I take the file on the command line and re-zip it, it produces a different result

> gunzip -n a.gz ;gzip -n a ; shasum a.gz 

50f478a9ceb292a2d14f1460d7c584b7a856e4d9  a.gz

How can I get it to match the original sha using /usr/bin/gzip and gunzip ?

You would have to match [compression level](https://stackoverflow.com/q/19138179/2970947) and you might need to match the buffer size (I'm not 100% certain on that second point). — Elliott Frisch, Feb 15 '20 at 19:39
Try adding `-1` or `-9` to `gzip` command, and see if that changes anything. — Andreas, Feb 15 '20 at 19:40
I check the compression level, but that doesn't work on any level. The file size matches fine. — ergonaut, Feb 15 '20 at 19:42
@ergonaut please provide the rest of the code (e.g. where `str` comes from). — syntagma, Feb 15 '20 at 19:54
@ergonaut What is the output of `file a.gz` before extracting and after creating the `.gz` file again? — Progman, Feb 15 '20 at 19:56
Does the file you create with Java uncompress with gunship? If so, what's the problem? — NomadMaker, Feb 16 '20 at 00:30
@NomadMaker - The problem is that he needs the SHA sums to be reproducible. For some reason. For example, maybe his application is using this to check for uniqueness of compressed files coming from different sources to check for tampering, or for de-duping. This is a reasonable requirement, IMO. — Stephen C, Feb 16 '20 at 01:27
Yes, but shouldn't need them identical to gzip's. If he wanted that, he could look at the gzip source code. — NomadMaker, Feb 16 '20 at 03:06

score 1 · Answer 1 · edited Oct 07 '21 at 11:32

I think that the problem is likely to be the Gzip file header.

The Gzip format has provision for including a file name and file timestamp in the file headers. (I see you are using the -n when uncompressing and recompressing ... which is probably correct here.)
The Gzip format also includes an "operating system id" in the header. This is supposed to identify the source file system type; e.g. 0 for FAT, 3 for UNIX, and so on.

Either of these could lead to differences in the Gzip files and hence different hashes.

If I was going to solve this myself, I would start by using cmp to see where the compressed file differences start, and then od to identify what the differences are. Refer to the Gzip file format spec to figure out what the differences mean:

RFC 1952 - GZIP file format specification version 4.3
Wikipedia's gzip page.

How can I get it to match the original SHA using gzip and gunzip ?

Assuming that the difference is the OS id, I don't think there is a practical way to solve this with the gzip and gunzip commands.

I looked at the source code for GZIPOutputStream in Java 11, and it is not promising.

It is hard-wiring the timestamp to zero.
It is hard-wiring the OS identifier to zero (which is supposed to mean FAT).

The hard-wiring is in a private method and would be next to impossible to "fix" by subclassing or reflection. You could copy the code and fix it that way, but then you have to maintain your variant GZIPOutputStream class indefinitely.

(I would be looking at changing the application ... or whatever ... so that I didn't need the checksums to be identical. You haven't said why you are doing this. It is for testing purposes only, try looking for a different way to implement the tests.)

Determining GZIPOutputStream behavior

1 Answers1