5

The following code (on node js v0.10.28):

var zlib = require('zlib');
var buf = new Buffer('uncompressed');

zlib.gzip(buf, function (err, result) {
 console.log(result.toString('base64'));
});

produces the string:

on Win 7 x64:

H4sIAAAAAAAACyvNS87PLShKLS5OTQEA3a5CsQwAAAA=

                       ^                                                          ^

on Mac

H4sIAAAAAAAAAyvNS87PLShKLS5OTQEA3a5CsQwAAAA

                       ^                                                          ^

on CentOs (Linux 2.6.32-279.19.1.el6.x86_64)

H4sIAAAAAAAAAyvNS87PLShKLS5OTQEA3a5CsQwAAAA=

                       ^                                                          ^

It seems that they differ on the ending = and the 13th character (C vs A) but I'm not sure why.

Mrchief
  • 75,126
  • 20
  • 142
  • 189
  • What is `zlib`? Is it `require('zlib')`? What node version is this? – mscdex Oct 22 '14 at 20:38
  • Does it decompress properly between environments? For example, if you compress on Windows and decompress on Mac, is it correct? – Joe Enos Oct 22 '14 at 21:38
  • @JoeEnos: Haven't tried that yet. I'll let you know tomorrow. – Mrchief Oct 23 '14 at 00:54
  • @JoeEnos: They decompress fine. And based on Cameron's excellent answer, it makes sense (the OS bit is ignored as its its just meta info). – Mrchief Oct 23 '14 at 14:29
  • Makes sense. But that OS-byte seems pretty worthless, if it's just metadata that has zero impact on the output. Maybe it's just me, but if I'm writing a compression algorithm whose purpose is to make data smaller, I'm not going to waste a byte like that. Imagine how many trillions of gzip files are out there in the world, each with an unnecessary byte - those really add up. Won't someone please think of the children? :) – Joe Enos Oct 23 '14 at 14:52
  • Its not worthless. That bit tells you how to handle OS specific stuff, e.g. line-endings. In my test, I'm using a plain string. If I was using a file, things would look different. And if you're worried about byte savings, you can switch to _deflate_ which will avoid that extra header with the risk of your payload not working occasionally in some browsers/clients. – Mrchief Oct 23 '14 at 15:33
  • 1
    To me, line-endings are not something a compression algorithm should worry about - CR and LF are just bytes, just like any other byte, so CRLF shouldn't compress any different from any other common pair of bytes, and CR or LF alone would be no different from "E" or any other single byte. And I would expect that the raw bytes of the pre-compressed input and the post-decompressed output should always be identical, regardless of the convention of the OS you're currently running. But I guess there must be scenarios where that isn't true, otherwise this wouldn't have been done. – Joe Enos Oct 23 '14 at 16:24
  • But it does. The _raw_ bytes on Windows are different than on *nixes. Line-endings are very much a part of the data (file contents). Just because you don't see them, doesn't mean they are not there. :) – Mrchief Oct 23 '14 at 17:25

1 Answers1

15

Some theoretical sources of differences in general

gzip (RFC 1952), which uses deflate (RFC 1951) as its compression format, is technically only a file format specification. In particular, algorithms are given significant latitude in how they choose to compress the bytes they're given (this is actually a strength).

There's two basic compression mechanisms that can be used with deflate:

  • Length-limited Huffman coding: Characters that appear more frequently can be given a shorter bit-sequence, and less frequent characters can be given a longer bit-sequence, leading to less bits overall to represent the same information. The Huffman tree used for encoding can be calculated dynamically from the input (or part of it), or it can be fixed. As a result, different encoders may use different Huffman trees for the same input, leading to different representations of the tree itself and of the encoded characters.

  • LZ77 compression: Substrings which have been previously output already need not be output again; instead, only a backreference with the length of the identical substring need be output. Since finding all common substrings in a given input is a Hard Problem™, it's often more efficient just to find as many as possible with a given heuristic (e.g. tracking the last 6 substrings that started with each two-character prefix). Again, different encoders can (validly) produce different output for the same input.

Finally, all of this compressed data is spread out into one or more blocks, and it's at the encoder's discretion when to switch to a new block. In theory, this could even be done every byte (though that wouldn't really be compression!). When ending a block, because its contents are encoded using the Huffman bit codes, it's possible that the block doesn't end on a byte boundary; in such a case, arbitrary bits can be added as padding to round to the next byte if the subsequent item in the stream must start on a whole byte (uncompressed blocks, for example, have to start on byte boundaries).

So as you can see, there's many ways that the compressed bytes for an identical input may differ! Even with the same algorithm (e.g. the canonical zlib library, not to be confused with the RFC (1950) of the same name), different compression levels generally lead to different results. It's even conceivable that the same program run multiple times in the same environment with the same input and options yield a different result, e.g. due to data structures that order pointers or use pointers as hash values -- pointer values can change between executions. Also, multithreaded implementations by their nature tend to be non-deterministic. In short, you should not depend on the output being the same for a given input, unless the implementation you're using explicitly provides that guarantee. (Although most sane implementations strive for determinacy, it's not technically required.)

Why your specific example Base64 strings differ

Setting aside the differences in trailing = signs for a minute, two of your three examples have the exact same representation. Those two differ by exactly one bit (C -> A) in the first part of the tenth byte (Base64 encodes triplets of bytes as quadruplets of base-64 characters, so the thirteenth Base64 character is the first six bits of the tenth byte). A represents 0, and C represents 2 -- but remember that this is the high six bits of the byte, so it's really 0 and 8 plus the low two bits. Those low two bits are the high two bits of the next Base64 character, y: y represents 50, which is 110010 in binary, so the low two bits of the tenth byte are 0b11, or 3. Putting it together, the tenth byte is the one that differs, with its value being 11 from one implementation and 3 from the other. A quick look at the gzip RFC reveals that the tenth byte indicates the operating system/filesystem on which the encoding was performed: Sure enough, 11 is defined as "NTFS filesystem (NT)", and 3 as "Unix". So the difference in this case is completely due to the operating system you performed the encoding on. (Note that the second dword of any gzip file is the timestamp, which was set to 0 (none available) in your examples, but could easily have differed wildly across all three trials, making the difference harder to spot.)

As for the trailing =, that's just Base64's padding (as explained nicely on Wikipedia). Since Base64 takes groups of three bytes and encodes them using four bytes, if the number of bytes encoded is not divisible by three, the minimum number of Base64 digits is used (treating bytes past the end of input as null bytes): For a single byte, only two Base64 digits are needed; for two, only three are needed. The = signs are just added to round the number of Base64 digits up to a multiple four; you'll note that this means the = signs are not really required in order to decode the Base64 string, since you know its length (but some Base64 decoders will reject the string if it's not a multiple of 4 in length). Hence, your second and third examples represent exactly the same byte values, but were produced by different Base64 encoders.

There you have it! I think my answer is rather too verbose for what nearly boils down to a trick question with a one-sentence answer, but I couldn't resist explaining everything in detail :-)

Cameron
  • 96,106
  • 25
  • 196
  • 225
  • 1
    I knew about base64 padding (even though my question indicates to the contrary), but the tenth byte explanation was what I was looking for! Can't thank you enough for this verbose answer because that is exactly what I was looking forward to! If I could, I would upvote this a 1000 times! – Mrchief Oct 23 '14 at 14:19
  • @Mrchief: I'm glad it's appreciated! :-) – Cameron Oct 23 '14 at 14:20
  • :) Yes, "very highly" appreciated! I learnt something new too! Also, looking at the spec, Macintosh identifier is 7, so I guess node's zlib is slightly out of compliance (slightly cause Macs are ultimately descendant's of Unix!) :) – Mrchief Oct 23 '14 at 14:29
  • 1
    @Mrchief: Actually no, since the RFC dates back to when "Macintosh" meant Mac OS (that was [not based on Unix](http://en.wikipedia.org/wiki/History_of_Mac_OS)). The modern Mac OS X has much more in common with Unix (in fact it *is* a Unix OS) than with its predecessor (at the sub-UI level, at least), and so it makes more sense to declare the OS as Unix than as Macintosh. – Cameron Oct 23 '14 at 14:34