I have a large file that I need to compress, however I need to ensure that the original file has the same hash value as the compressed one. I tried it on a smaller file, hash values are different but I am thinking that this might be because of metadata change. How do I ensure that the files don't change after compression?
-
First of all, if you hash the original uncompressed file, and then hash the compressed file, then yes, those will have different hashvalues. On the other hand, if you hash the original uncompressed file, and then hash the content behind the compression (ie. you decompress and hash), then no, a different hash would indicate broken compression or decompression. – Lasse V. Karlsen Nov 10 '16 at 14:47
2 Answers
It depends on which shash you are using. If you are using crc32 it's pretty trivial to make your hashes the same. MD5 might be possible already (I don't know the start of the art there), SHA1 will probably be doable in a few years. If you are using SHA256, better give up.
Snark about broken crypto aside, unless your hash algorithm knows specifically about your compression setup or your input file was very carefully crafted to provoke a hash collision: the hash will change before and after compression. That means any standard cryptographic hash will change upon compression.
All the hash algorithm sees are a stream of bits without any meaning. It does not know about compression schemes, and should not.

- 133
- 2
- 10
-
I tried hashing them before and after - the hash values turn out to be different, I was therefore wondering if this was maybe only due to metadata about the file being different (the zip has a different name and creation date). – michal111 Nov 10 '16 at 12:59
-
Which platform are you on? If you are in a Linux/Mac/Unix setup, it's easiest to explain: – Matthias Nov 10 '16 at 13:01
-
In `$ cat myFile | md5sum` vs `$ cat myFile | compressionProgram | md5sum` the md5sum program doesn't see any metadata or file names at all. – Matthias Nov 10 '16 at 13:02
If your hash is a CRC-32, then you can insert or append four bytes to the compressed data, and set those to get the original CRC. For example, in a gzip stream you can insert a four-byte extra block in the header.
The whole point of cryptographic hashes, like MD5 noted as a tag to the question, is to make that exceedingly difficult, or practically impossible.

- 101,978
- 13
- 118
- 158