3

I'm working on a Project which involves computation of Hashes for Files. The Project is like a File Backup Service, So when a file gets uploaded from Client to Server, i need to check if that file is already available in the server. I generate a CRC-32 Hash for the file and then send the hash to server to check if it's already available.

If the file is not in server, i used to send the file as 512 KB Chunks[for Dedupe] and i have to calculate hash for this each 512 KB Chunk. The file sizes may be of few GB's sometimes and multiple clients will connect to the server. So i really need a Fast and LightWeight Hashing algorithm for files. Any ideas ..?

P.S : I have already noticed some Hashing Algorithm questions in StackOverflow, but the answer's not quite comparison of the Hashing Algorithms required exactly for this kind of Task. I bet this will be really useful for a bunch of People.

Manikandaraj Srinivasan
  • 3,557
  • 5
  • 35
  • 62
  • Sounds like you want to precompute a "hash chain" for your file: has the *last* chunk, then hash the second-to-last chunk with the last hash appended, and so forth. Distribute only the hash of the first chunk plus second hash. Since it's precomputed, the time it takes to hash may not be a primary concern. – Kerrek SB Nov 30 '12 at 13:57
  • Those chunks are independent and i have to combine the chunks to a whole file,while receiving the data back on Client.I'm planning on Chunks and their hashes for deduplication, like if the hash of a chunk is present already, then i don't have to send the chunk from client to server – Manikandaraj Srinivasan Nov 30 '12 at 14:45
  • Looks like processing 512K chunks of the file with an algorithm that's well adapted to stream processors is the work for a GPU. CRC32 is way too basic. Take a look at MD5 (but avoid using it in a way that can be brute-forced... using a GPU... irony). – ActiveTrayPrntrTagDataStrDrvr Nov 30 '12 at 15:00
  • @ActiveTrayPrntrTagDataStrDrvr i looked at MD5,it seems slow for calculating Hash for a Large File and hope MD5 era has gone. – Manikandaraj Srinivasan Nov 30 '12 at 17:59

3 Answers3

4

Actually, CRC32 does not have neither the best speed, neither the best distribution.

This is to be expected : CRC32 is pretty old by today's standard, and created in an era when CPU were not 32/64 bits wide nor OoO-Ex, also distribution properties were less important than error detection. All these requirements have changed since.

To evaluate the speed and distribution properties of hash algorithms, Austin Appleby created the excellent SMHasher package. A short summary of results is presented here. I would advise to select an algorithm with a Q.Score of 10 (perfect distribution).

Cyan
  • 13,248
  • 8
  • 43
  • 78
0

You say you are using CRC-32 but want a faster hash. CRC-32 is very basic and pretty fast. I would think the I/O time would be much longer than the hash time. You also want a hash that will not have collisions. That is two different files or 512 KB chunks gets the same hash value. You could look at any of the cryptographic hashs like MD5 (do not use for secure applications) or SHA1.

brian beuning
  • 2,836
  • 18
  • 22
0

If you are only using CRC-32 to check if a file is a duplicate, you are going to get false duplicates because different files can have the same crc-32. You had better use sha-1, crc-32 and md5 are both too weak.