0

I'm trying to compare the output of running strings through GZIP on both Java, and the CLI gzip command. The outputs are not the same, and I've figure out why, but not sure how to get them to jive with each other.

I've read a number of questions on SO, have read the man pages for gzip, and the code for both GZIPOutputStream and DeflaterOutputStream. The default compression level for GZIPOutputStream (set through Deflator) is "-1", and there's little explanation as to what that means. Furthermore, gzip CLI only allows for values between 1 and 9, inclusive.

So is there a way I can change the compression settings in either Java or the gzip command to make them produce the same output?

Spencer Kormos
  • 8,381
  • 3
  • 28
  • 45
  • Have you tried changing from the default compression level in Java to another like this? OutputStream gzipout = new GZIPOutputStream(bos){{def.setLevel(Deflater.BEST_COMPRESSION);}}; to maybe match gzip --best (or -9) – Mark Setchell Feb 28 '14 at 14:57
  • @MarkSetchell It's an interesting thought, but this will set the level after the constructer has completed it's other operations, which at that point has already set the default compression, wrote the header, etc. – Spencer Kormos Feb 28 '14 at 15:53

1 Answers1

2

No. Java is using the zlib deflator, which is derived from but not exactly the same as the older gzip command-line utility deflator. They will generally not produce the same output and there are no settings to coerce them to do so.

Compression level -1 requests the default compression level, which in the current zlib implementation is level 6.

I would have to ask why you care to get their outputs to be the same. All that matters is that the compression is lossless, i.e., that both the gzip and Java compressed streams produce the same original data when decompressed. There is no requirement that, for example, different versions of zlib produce the same output at the same compression level.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • This is what I had thought as well, but needed my suspicions confirmed. It was a requirement from "the powers that be", and I needed a a definitive reason beyond "it makes better sense to just do a round-trip test", as that didn't seem to be convincing enough. So the output being different from version to version definitely invalidates any encoding comparisons, and saves everyone a lot of headaches later. Thanks for the prompt and clear answer. – Spencer Kormos Feb 28 '14 at 15:59