39

Which checksum algorithm can you recommend in the following use case?

I want to generate checksums of small JPEG files (~8 kB each) to check if the content changed. Using the filesystem's date modified is unfortunately not an option.
The checksum need not be cryptographically strong but it should robustly indicate changes of any size.

The second criterion is speed since it should be possible to process at least hundreds of images per second (on a modern CPU).

The calculation will be done on a server with several clients. The clients send the images over Gigabit TCP to the server. So there's no disk I/O as bottleneck.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Benedikt Waldvogel
  • 12,406
  • 8
  • 49
  • 61
  • 1
    1 Gigabit is 125 Megabytes (full duplex). Of those 125MB/s, a very significant proportion will be networking overhead (particularly when you want to send hundreds of something small). Since MD5 on a small core can run at nearly 250MB/s, you should change it from **no disk I/O as bottleneck** to something like **entirely disk I/O as bottleneck**. This question has been up for 7 years and no one seems to have mentioned that if you had only tried something before posting to StackOverflow, you'd have seen this for yourself. – J.J Mar 01 '16 at 14:42
  • As I understand it, the server could precalculate the checksum of all its files. Thus it wouldn't need to do any disk I/O at all? It only needs to read the files sent by clients and checksum those, and compare to its precalculated checksums? – avl_sweden Aug 07 '20 at 07:05

11 Answers11

23

If you have many small files, your bottleneck is going to be file I/O and probably not a checksum algorithm.

A list of hash functions (which can be thought of as a checksum) can be found here.

Is there any reason you can't use the filesystem's date modified to determine if a file has changed? That would probably be faster.

luke
  • 36,103
  • 8
  • 58
  • 81
  • 2
    Agreed. Most non-cryptographically-secure hash algorithms are short enough that there's an appreciable chance of false negatives, so I would simply use MD5 or SHA1 and be done with it. Computation speed is unlikely to be an issue. – Nick Johnson Sep 24 '08 at 07:44
  • 4
    @luke I also want to check file modification, and last time I did review my archive I noticed that a few percent of the files in my archive (a small 100GB archive on two different disks) is broken! The timestamp was not changed, but the file contents was silently corrupted. Some of the files are more than 10 years old, so I must have moved them at least twice from one machine to another. So checksum is a must. – bartosz.r Feb 06 '12 at 15:44
16

There are lots of fast CRC algorithms that should do the trick: http://www.google.com/search?hl=en&q=fast+crc&aq=f&oq=

Edit: Why the hate? CRC is totally appropriate, as evidenced by the other answers. A Google search was also appropriate, since no language was specified. This is an old, old problem which has been solved so many times that there isn't likely to be a definitive answer.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • With enough files, a collision becomes very probable. Look up 'birthday paradox'. For CRC32 you only need about 100 000 images for a collision to be highly probable. – avl_sweden Aug 07 '20 at 07:09
  • 4
    @avl_sweden not sure why you directed that comment at me. From the comments by the question author, the intent is to detect file corruption not deduplication, so duplicate CRCs won't matter. That's exactly the problem CRC was invented to solve. – Mark Ransom Aug 07 '20 at 14:34
  • 1
    Good point! Please accept my apologies for my misunderstanding! – avl_sweden Aug 08 '20 at 19:59
10
  • CRC-32 comes into mind mainly because it's cheap to calculate

  • Any kind of I/O comes into mind mainly because this will be the limiting factor for such an undertaking ;)

  • The problem is not calculating the checksums, the problem is to get the images into memory to calculate the checksum.

  • I would suggest "stagged" monitoring:

    • stage 1: check for changes of file timestamps and if you detect a change there hand over to...
      (not needed in your case as described in the edited version)

    • stage 2: get the image into memory and calculate the checksum

  • For sure important as well: multi-threading: setting up a pipeline which enables processing of several images in parallel if several CPU cores are available.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
pointernil
  • 608
  • 8
  • 17
9

If you are receiving the files over network you can calculate the checksum as you receive the file. This will ensure that you will calculate the checksum while the data is in memory. Hence you won't have to load them into memory from disk.

I believe if you apply this method, you'll see almost-zero overhead on your system.

This is the routines I'm using on an embedded system which does checksum control on firmware and other stuff.

static const uint32_t crctab[] = {
    0x0,
    0x04c11db7, 0x09823b6e, 0x0d4326d9, 0x130476dc, 0x17c56b6b,
    0x1a864db2, 0x1e475005, 0x2608edb8, 0x22c9f00f, 0x2f8ad6d6,
    0x2b4bcb61, 0x350c9b64, 0x31cd86d3, 0x3c8ea00a, 0x384fbdbd,
    0x4c11db70, 0x48d0c6c7, 0x4593e01e, 0x4152fda9, 0x5f15adac,
    0x5bd4b01b, 0x569796c2, 0x52568b75, 0x6a1936c8, 0x6ed82b7f,
    0x639b0da6, 0x675a1011, 0x791d4014, 0x7ddc5da3, 0x709f7b7a,
    0x745e66cd, 0x9823b6e0, 0x9ce2ab57, 0x91a18d8e, 0x95609039,
    0x8b27c03c, 0x8fe6dd8b, 0x82a5fb52, 0x8664e6e5, 0xbe2b5b58,
    0xbaea46ef, 0xb7a96036, 0xb3687d81, 0xad2f2d84, 0xa9ee3033,
    0xa4ad16ea, 0xa06c0b5d, 0xd4326d90, 0xd0f37027, 0xddb056fe,
    0xd9714b49, 0xc7361b4c, 0xc3f706fb, 0xceb42022, 0xca753d95,
    0xf23a8028, 0xf6fb9d9f, 0xfbb8bb46, 0xff79a6f1, 0xe13ef6f4,
    0xe5ffeb43, 0xe8bccd9a, 0xec7dd02d, 0x34867077, 0x30476dc0,
    0x3d044b19, 0x39c556ae, 0x278206ab, 0x23431b1c, 0x2e003dc5,
    0x2ac12072, 0x128e9dcf, 0x164f8078, 0x1b0ca6a1, 0x1fcdbb16,
    0x018aeb13, 0x054bf6a4, 0x0808d07d, 0x0cc9cdca, 0x7897ab07,
    0x7c56b6b0, 0x71159069, 0x75d48dde, 0x6b93dddb, 0x6f52c06c,
    0x6211e6b5, 0x66d0fb02, 0x5e9f46bf, 0x5a5e5b08, 0x571d7dd1,
    0x53dc6066, 0x4d9b3063, 0x495a2dd4, 0x44190b0d, 0x40d816ba,
    0xaca5c697, 0xa864db20, 0xa527fdf9, 0xa1e6e04e, 0xbfa1b04b,
    0xbb60adfc, 0xb6238b25, 0xb2e29692, 0x8aad2b2f, 0x8e6c3698,
    0x832f1041, 0x87ee0df6, 0x99a95df3, 0x9d684044, 0x902b669d,
    0x94ea7b2a, 0xe0b41de7, 0xe4750050, 0xe9362689, 0xedf73b3e,
    0xf3b06b3b, 0xf771768c, 0xfa325055, 0xfef34de2, 0xc6bcf05f,
    0xc27dede8, 0xcf3ecb31, 0xcbffd686, 0xd5b88683, 0xd1799b34,
    0xdc3abded, 0xd8fba05a, 0x690ce0ee, 0x6dcdfd59, 0x608edb80,
    0x644fc637, 0x7a089632, 0x7ec98b85, 0x738aad5c, 0x774bb0eb,
    0x4f040d56, 0x4bc510e1, 0x46863638, 0x42472b8f, 0x5c007b8a,
    0x58c1663d, 0x558240e4, 0x51435d53, 0x251d3b9e, 0x21dc2629,
    0x2c9f00f0, 0x285e1d47, 0x36194d42, 0x32d850f5, 0x3f9b762c,
    0x3b5a6b9b, 0x0315d626, 0x07d4cb91, 0x0a97ed48, 0x0e56f0ff,
    0x1011a0fa, 0x14d0bd4d, 0x19939b94, 0x1d528623, 0xf12f560e,
    0xf5ee4bb9, 0xf8ad6d60, 0xfc6c70d7, 0xe22b20d2, 0xe6ea3d65,
    0xeba91bbc, 0xef68060b, 0xd727bbb6, 0xd3e6a601, 0xdea580d8,
    0xda649d6f, 0xc423cd6a, 0xc0e2d0dd, 0xcda1f604, 0xc960ebb3,
    0xbd3e8d7e, 0xb9ff90c9, 0xb4bcb610, 0xb07daba7, 0xae3afba2,
    0xaafbe615, 0xa7b8c0cc, 0xa379dd7b, 0x9b3660c6, 0x9ff77d71,
    0x92b45ba8, 0x9675461f, 0x8832161a, 0x8cf30bad, 0x81b02d74,
    0x857130c3, 0x5d8a9099, 0x594b8d2e, 0x5408abf7, 0x50c9b640,
    0x4e8ee645, 0x4a4ffbf2, 0x470cdd2b, 0x43cdc09c, 0x7b827d21,
    0x7f436096, 0x7200464f, 0x76c15bf8, 0x68860bfd, 0x6c47164a,
    0x61043093, 0x65c52d24, 0x119b4be9, 0x155a565e, 0x18197087,
    0x1cd86d30, 0x029f3d35, 0x065e2082, 0x0b1d065b, 0x0fdc1bec,
    0x3793a651, 0x3352bbe6, 0x3e119d3f, 0x3ad08088, 0x2497d08d,
    0x2056cd3a, 0x2d15ebe3, 0x29d4f654, 0xc5a92679, 0xc1683bce,
    0xcc2b1d17, 0xc8ea00a0, 0xd6ad50a5, 0xd26c4d12, 0xdf2f6bcb,
    0xdbee767c, 0xe3a1cbc1, 0xe760d676, 0xea23f0af, 0xeee2ed18,
    0xf0a5bd1d, 0xf464a0aa, 0xf9278673, 0xfde69bc4, 0x89b8fd09,
    0x8d79e0be, 0x803ac667, 0x84fbdbd0, 0x9abc8bd5, 0x9e7d9662,
    0x933eb0bb, 0x97ffad0c, 0xafb010b1, 0xab710d06, 0xa6322bdf,
    0xa2f33668, 0xbcb4666d, 0xb8757bda, 0xb5365d03, 0xb1f740b4
};

typedef struct crc32ctx
{
    uint32_t crc;
    uint32_t length;
} CRC32Ctx;


#define COMPUTE(var, ch)    (var) = (var) << 8 ^ crctab[(var) >> 24 ^ (ch)]

void crc32_stream_init( CRC32Ctx* ctx )
{
    ctx->crc = 0;
    ctx->length = 0;
}

void crc32_stream_compute_uint32( CRC32Ctx* ctx, uint32_t data )
{
    COMPUTE( ctx->crc, data & 0xFF );
    COMPUTE( ctx->crc, ( data >> 8 ) & 0xFF );
    COMPUTE( ctx->crc, ( data >> 16 ) & 0xFF );
    COMPUTE( ctx->crc, ( data >> 24 ) & 0xFF );
    ctx->length += 4;
}

void crc32_stream_compute_uint8( CRC32Ctx* ctx, uint8_t data )
{
    COMPUTE( ctx->crc, data );
    ctx->length++;
}

void crc32_stream_finilize( CRC32Ctx* ctx )
{
    uint32_t len = ctx->length;
    for( ; len != 0; len >>= 8 )
    {
        COMPUTE( ctx->crc, len & 0xFF );
    }
    ctx->crc = ~ctx->crc;
}

/*** pseudo code ***/
CRC32Ctx crc;
crc32_stream_init(&crc);

while((just_received_buffer_len = received_anything()))
{
    for(int i = 0; i < just_received_buffer_len; i++)
    {
        crc32_stream_compute_uint8(&crc, buf[i]); // assuming buf is uint8_t*
    }
}
crc32_stream_finilize(&crc);
printf("%x", crc.crc); // ta daaa
Engin
  • 166
  • 1
  • 4
6

CRC

radu_c
  • 147
  • 1
  • 4
5

adler32, available in the zlib headers, is advertised as being significantly faster than crc32, while being only slightly less accurate.

Josh Matthews
  • 12,816
  • 7
  • 36
  • 39
  • 7
    The Adler checksum is very weak for small amounts of data, less than a few hundred bytes. The SCTP protocol originally used Adler-32 as its checksum but had to switch to CRC-32 for this reason (in RFC 3309). Though the questioners average size is 8K, for the smallest files it could be a problem. – DGentry Sep 23 '08 at 19:23
3

CRC32 is probably good enough, although there's a small chance you might get a collision, such that a file that has been modified might look like it hasn't been because the two versions generate the same checksum. To avoid this possibility I'd therefore suggest using MD5, which will easily be fast enough, and the chances of a collision occurring is reduced to the point where it's almost infinitessimal.

As others have said, with lots of small files your real performance bottleneck is going to be I/O so the issue is dealing with that. If you post up a few more details somebody will probably suggest a way of sorting that out as well.

Bart Read
  • 2,717
  • 3
  • 22
  • 32
3

Your most important requirement is "to check if the content changed".

If it most important that ANY change in the file be detected, MD-5, SHA-1 or even SHA-256 should be your choice.

Given that you indicated that the checksum NOT be cryptographically good, I would recommend CRC-32 for three reasons. CRC-32 gives good hamming distances over an 8K file. CRC-32 will be at least an order of magnitude faster than MD-5 to calculate (your second requirement). Sometimes as important, CRC-32 only requires 32 bits to store the value to be compared. MD-5 requires 4 times the storage and SHA-1 requires 5 times the storage.

BTW, any technique will be strengthened by prepending the length of the file when calculating the hash.

Epsilon
  • 784
  • 5
  • 5
2

According to the Wiki page pointed to by Luke, MD5 is actually faster than CRC32!

I have tried this myself by using Python 2.6 on Windows Vista, and got the same result.

Here are some results:

crc32: 162.481544276 MBps md5: 224.489791549 MBps

crc32: 168.332996575 MBps md5: 226.089336532 MBps

crc32: 155.851515828 MBps md5: 194.943289532 MBps

I am thinking about the same question as well, and I'm tempted to use the Rsync's variation of Adler-32 for detecting file differences.

chnrxn
  • 1,349
  • 1
  • 16
  • 17
  • 2
    That's unusual. Is the quality of the implementation of CRC32 and MD5 equivalent, or does one use every compiler optimization and lookup table trick there is, while the other one doesn't? – Ants Mar 24 '10 at 23:59
1

Just a postscript to the above; jpegs use lossy compression and the extent of the compression may depend upon the program used to create the jpeg, the colour pallette and/or bit-depth on the system, display gamma, graphics card and user-set compression levels/colour settings. Therefore, comparing jpegs built on different computers/platforms or using different software will be very difficult at the byte level.

0

This is 5 times faster than CCITT and makes exactly the same job:

Python:

def crc16_fast(data: bytearray, length):
crc = 0xCACA
for i in range(length):
    crc ^= data[i]
return crc

C:

uint16_t crc16_fast(const uint16_t* data, size_t length)
{
    uint16_t crc = 0xCACA;
    for (size_t i = 0; i < length; i++) 
        crc ^= data[i];
    return crc;
}
mathengineer
  • 140
  • 6