0

At this moment I'm using a simple checksum scheme, that just adds the words in a buffer. Firstly, my question is what is the probability of a false negative, that is, the receiving system calculating the same checksum as the sending system even when the data is different (corrupted).

Secondly, how can I reduce the probability of false negatives? What is the best checksuming scheme for that. Note that each word in the buffer is of size 64 bits or 8 bytes, that is a long variable in a 64 bit system.

pythonic
  • 20,589
  • 43
  • 136
  • 219
  • 3
    Is there a reason you are not using an industry-standard checksum like CRC, MD5 or SHA? – Jon Mar 01 '12 at 10:44
  • do I understand correctly that the order of words doesn't matter? note: if you assign a unique id for each input (store it in a database) then you can reduce the probability to zero.. question is do you need that? – Karoly Horvath Mar 01 '12 at 10:45
  • Assigning a probability of non detection of changes is possible only if you give the probability of the various changes. Different checksum will behave differently for different change patterns. – AProgrammer Mar 01 '12 at 12:41

2 Answers2

1

Assuming a sane checksum implementation, then the probability of a randomly-chosen input string colliding with a reference input string is 1 in 2n, where n is the checksum length in bits.

However, if you're talking about input that differs from the original by a low number of bits, then the probability of collision is generally much, much lower.

Oliver Charlesworth
  • 267,707
  • 33
  • 569
  • 680
  • this is only true for two entries. the probability of collision depends on the number of entries.. also, see birthday paradox – Karoly Horvath Mar 01 '12 at 10:49
  • @yi_H: I'm talking about the probability of a single input string causing a collision with a reference input. Let me clarify my answer. – Oliver Charlesworth Mar 01 '12 at 10:49
  • @user1018562: By "sane", I mean "written by someone competent"! I'm really implying that it would be possible to create a checksum algorithm that doesn't exhibit these properties. – Oliver Charlesworth Mar 01 '12 at 10:56
  • @user1018562: for the sake of argument, use SHA-256 until you come up with a reason not to. Other sane checksums / message digests exist. – Steve Jessop Mar 01 '12 at 10:58
  • @Oli: in practice there probably are checksums that are generally regarded as "sane", but aren't quite uniformly distributed. But I agree with you to a first approximation :-) And obviously, if the probability of collision for similar strings is much lower than 1/2^n (for an error-correcting checksum and a suitable definition of "similar", it's 0), then the probability of collision for random strings must be at least slightly higher: it all has to add up to 1. – Steve Jessop Mar 01 '12 at 11:01
  • @SteveJessop: The set of similar strings is a subset of the set of random strings! The average probability is 1/2^n (assuming a linear checksum). – Oliver Charlesworth Mar 01 '12 at 11:06
  • @Oli: rats, I'm confused now. – Steve Jessop Mar 01 '12 at 11:07
0

One possibility is to have a look at T. Maxino's thesis titled "The Effectiveness of Checksums for Embedded Networks" (PDF), which contains an analysis for some well-known checksums.

However, usually it is better to go with CRCs, which have additional benefits, such as detection of burst errors.

For these, P. Koopman's paper "Cyclic Redundancy Code (CRC) Selection for Embedded Networks" (PDF) is a valuable resource.

Schedler
  • 1,403
  • 9
  • 8