Some theoretical sources of differences in general
gzip
(RFC 1952), which uses deflate
(RFC 1951) as its compression format, is technically only a file format specification. In particular, algorithms are given significant latitude in how they choose to compress the bytes they're given (this is actually a strength).
There's two basic compression mechanisms that can be used with deflate
:
Length-limited Huffman coding: Characters that appear more frequently can be given a shorter bit-sequence, and less frequent characters can be given a longer bit-sequence, leading to less bits overall to represent the same information. The Huffman tree used for encoding can be calculated dynamically from the input (or part of it), or it can be fixed. As a result, different encoders may use different Huffman trees for the same input, leading to different representations of the tree itself and of the encoded characters.
LZ77 compression: Substrings which have been previously output already need not be output again; instead, only a backreference with the length of the identical substring need be output. Since finding all common substrings in a given input is a Hard Problem™, it's often more efficient just to find as many as possible with a given heuristic (e.g. tracking the last 6 substrings that started with each two-character prefix). Again, different encoders can (validly) produce different output for the same input.
Finally, all of this compressed data is spread out into one or more blocks, and it's at the encoder's discretion when to switch to a new block. In theory, this could even be done every byte (though that wouldn't really be compression!). When ending a block, because its contents are encoded using the Huffman bit codes, it's possible that the block doesn't end on a byte boundary; in such a case, arbitrary bits can be added as padding to round to the next byte if the subsequent item in the stream must start on a whole byte (uncompressed blocks, for example, have to start on byte boundaries).
So as you can see, there's many ways that the compressed bytes for an identical input may differ! Even with the same algorithm (e.g. the canonical zlib library, not to be confused with the RFC (1950) of the same name), different compression levels generally lead to different results. It's even conceivable that the same program run multiple times in the same environment with the same input and options yield a different result, e.g. due to data structures that order pointers or use pointers as hash values -- pointer values can change between executions. Also, multithreaded implementations by their nature tend to be non-deterministic. In short, you should not depend on the output being the same for a given input, unless the implementation you're using explicitly provides that guarantee. (Although most sane implementations strive for determinacy, it's not technically required.)
Why your specific example Base64 strings differ
Setting aside the differences in trailing =
signs for a minute, two of your three examples have the exact same representation. Those two differ by exactly one bit (C
-> A
) in the first part of the tenth byte (Base64 encodes triplets of bytes as quadruplets of base-64 characters, so the thirteenth Base64 character is the first six bits of the tenth byte). A
represents 0, and C
represents 2
-- but remember that this is the high six bits of the byte, so it's really 0 and 8 plus the low two bits. Those low two bits are the high two bits of the next Base64 character, y
: y
represents 50, which is 110010
in binary, so the low two bits of the tenth byte are 0b11
, or 3. Putting it together, the tenth byte is the one that differs, with its value being 11 from one implementation and 3 from the other. A quick look at the gzip
RFC reveals that the tenth byte indicates the operating system/filesystem on which the encoding was performed: Sure enough, 11 is defined as "NTFS filesystem (NT)", and 3 as "Unix". So the difference in this case is completely due to the operating system you performed the encoding on. (Note that the second dword of any gzip
file is the timestamp, which was set to 0 (none available) in your examples, but could easily have differed wildly across all three trials, making the difference harder to spot.)
As for the trailing =
, that's just Base64's padding (as explained nicely on Wikipedia). Since Base64 takes groups of three bytes and encodes them using four bytes, if the number of bytes encoded is not divisible by three, the minimum number of Base64 digits is used (treating bytes past the end of input as null bytes): For a single byte, only two Base64 digits are needed; for two, only three are needed. The =
signs are just added to round the number of Base64 digits up to a multiple four; you'll note that this means the =
signs are not really required in order to decode the Base64 string, since you know its length (but some Base64 decoders will reject the string if it's not a multiple of 4 in length). Hence, your second and third examples represent exactly the same byte values, but were produced by different Base64 encoders.
There you have it! I think my answer is rather too verbose for what nearly boils down to a trick question with a one-sentence answer, but I couldn't resist explaining everything in detail :-)