8

I am doing some image processing code where in I download some images(as BufferedImage) from URLs and pass it on to a image processor.

I want to avoid passing of the same image more than once to the image processor(as the image processing operation is of high cost). The URL end points of the images(if they are same images) may vary and hence I can prevent this by the URL. So I was planning to do a checksum or hash to identify if the code is encountering the same image again.

For md5 I tried Fast MD5, and it generated a 20K+ character length hex checksum value for the image(some sample). Obviously storing this 20K+ character hash would be an issue when it comes to database storage. Hence I tried the CRC32(from java.util.zip.CRC32). And it did generate quite smaller length check sum than the hash.

I do understand checksum and hash are for different purposes. For the purpose explained above can I just use the CRC32? Would it solve the purpose or I have to try something more than these two?

Thanks, Abi

Abhishek
  • 6,862
  • 22
  • 62
  • 79
  • 1
    [Checksum and hash sum are the same](http://en.wikipedia.org/wiki/Checksum). Actually you just look at different algorithms. – Andreas Dolk Jun 17 '11 at 06:33
  • 1
    128bit MD5 hash should be enough for your purpose. – Thor Jun 17 '11 at 06:35
  • 5
    BTW - MD5 should create a 128 bit hash value while a crc32 has 32 bits... What have you done to generate 20k+ length hex checksums? – Andreas Dolk Jun 17 '11 at 06:35
  • I believe this would depend on how many images are going to be processed. Collision probability will increase as the number of images increase. After all there is only so much one can do with 32-bits (CRC32) or 128-bits (MD5). – Vineet Reynolds Jun 17 '11 at 06:36
  • 2
    Maybe compare in phases; first, check the dimensions and file size. If there's no match, pass it along to the processor. Second, take a hash of the first row or two of pixels, or the first 1K, etc (which would be stored in the DB; much smaller since it's only a subset of the image). If those two tests are equal, then and only then, take a hash of the original and new file. This should eliminate a large part of the set before actually hashing the entire image. – AC2MO Jun 17 '11 at 06:37
  • @Andreas_D - that was the same thing I was wondering as well :-( I should check the code again... – Abhishek Jun 17 '11 at 06:52
  • @Gregory Hoerner - Yes will use the idea of two pass checking thanks a bunch – Abhishek Jun 17 '11 at 06:52
  • Abhishek, can you share some link or piece of code that can help me to inspire who to achieve comparing two different images using java.util.zip.CRC32. Many thanks . – javatar Oct 01 '12 at 06:24

2 Answers2

5

The difference between CRC and, say, MD5, is that it is more difficult to tamper a file to match a "target" MD5 than to tamper it to match a "target" checksum. Since this does not seem a problem for your program, it should not matter which method do you use. Maybe MD5 might be a little more CPU intensive, but I do not know if that different will matter.

The main question should be the number of bytes of the digest.

If you are doing a checksum in an integer will mean that, for a file of 2K size, you are fitting 2^2048 combinations into 2^32 combinations --> for every CRC value, you will have 2^64 possible files that match it. If you have a 128 bits MD5, then you have 2^16 possible collisions.

The bigger the code that you compute, the less possible collisions (given that the codes computed are distributed evenly), so the safer the comparation.

Anyway, in order to minimice possible errors, I think the first classification should be using file size... first compare file sizes, if they match then compare checksums/hash.

SJuan76
  • 24,532
  • 6
  • 47
  • 87
2

A checksum and a hash are basically the same. You should be able to calculate any kind of hash. A regular MD5 would normally suffice. If you like, you could store the size and the md5 hash (which is 16 bytes, I think).

If two files have different sizes, thay are different files. You will not even need to calculate a hash over the data. If it is unlikely that you have many duplicate files, and the files are of the larger kind (like, JPG pictures taken with a camera), this optimization may spare you a lot of time.

If two or more files have the same size, you can calculate the hashes and compare them.

If two hashes are the same, you could compare the actual data to see if this is different after all. This is very, very unlikely, but theoretically possible. The larger your hash (md5 is 16 bytes, while CR32 is only 4), the less likely that two different files will have the same hash. It will take only 10 minutes of programming to perform this extra check though, so I'd say: better safe than sorry. :)

To further optimize this, if exactly two files have the same size, you can just compare their data. You will need to read the files anyway to calculate their hashes, so why not compare them directly if they are the only two with that specific size.

Duncan McGregor
  • 17,665
  • 12
  • 64
  • 118
GolezTrol
  • 114,394
  • 18
  • 182
  • 210
  • 1
    Maybe there could be an issue with storing the files already processed to compare them with the new ones. A checksum or hash takes way less space. – SJuan76 Jun 17 '11 at 06:55
  • 1
    That's true. I never meant to store the entire file in the database for comparison. Just saying that for a single run, you would't need to calculate a hash at all. If you do store the data to check newly added files, then it makes sense to store a hash, or you could choose to store the file size only, and calculate (and store) the hash only if two file sizes match. That will save space, and save disk IO. – GolezTrol Jun 17 '11 at 07:48