Is Fletchers16 checksum suitable for small data?

Question

Using the straight forward implementation on wikipedia Fletcher's checksum we get the same checksum for data such as "BCA" and "CAB" as well as "BAC" and "ACB".

Is this expected?
Should not the Fletcher16 checksum account for the order of the blocks?

The deficiency can easily be fixed by OR'ing the index with the data as shown in the code below....

uint16_t fletcher16( uint8_t *data, int count )
{
   uint16_t sum1 = 0;
   uint16_t sum2 = 0;
   int index;

   for( index = 0; index < count; ++index )
   {
      //sum1 = (sum1 + data[index]) % 255; // Original
      sum1 = (sum1 + index | data[index]) % 255; // The "fix"
      sum2 = (sum2 + sum1) % 255;
   }

   return (sum2 << 8) | sum1;
}

"OR'ing the index" discards information. "XOR'ing the index" or "adding" would make more sense. — chux - Reinstate Monica, Aug 11 '17 at 12:39
@MichaelFoukarakis Changing to XOR does not change answers for the 3 questions as they all refer to the originator algorithm which lacks the `(X)OR index`. It does affect the quality of the suggested improvement. — chux - Reinstate Monica, Aug 11 '17 at 12:46

chux - Reinstate Monica · Accepted Answer · 2017-08-11T16:15:59.047

... we get the same checksum... Is this expected?

Yes, as it is possible. The checksum is 16-bit (and 511 combinations never occur: 0x..FF, 0xFF..) so 3 character strings 24-bits will certainly have collisions. Pigeonhole principle

Should not the Fletcher16 checksum account for the order of the blocks?

It does. It is just that the algorithm collides readily with select similar inputs. Also see Hamming distance

BTW, the original algorithm gives different results, if the length or size (also check the null character ) of the string is used. Also, the 4 inputs strings gave a different pair of results.

  printf("%x\n", fletcher16("BCA",3)); // 8ec6
  printf("%x\n", fletcher16("CAB",3)); // 8ec6 same
  printf("%x\n", fletcher16("BAC",3)); // 8cc6 
  printf("%x\n", fletcher16("ACB",3)); // 8cc6 same

  printf("%x\n", fletcher16("BCA",4)); // 55c6
  printf("%x\n", fletcher16("CAB",4)); // 55c6 same
  printf("%x\n", fletcher16("BAC",4)); // 53c6
  printf("%x\n", fletcher16("ACB",4)); // 53c6 same

OP's suggested improvement weakens the checksum also as by or-ing in the index, which disregards select bits at each stage. Suggest xor-ing or adding instead.

Minor nits:

// return (sum2 << 8) | sum1;
return (1u*sum2 << 8) | sum1;

This change is not determental for all int/unsigned sizes yet avoids implementation defined behavior when int/unsigned are 16-bit. Best to insure code does not left-shift into the sign bit.

some_int % 255 performs a signed remainder. On devices, simple embedded ones for example, an unsigned remainder is certainly as fast or faster. Nothing to be lost with % 255u, but potential improvements.

[Edit]

Although OP has "fixed" code for short strings, it defeats a design parameter of fletcher16(), that of speed of execution.

Details: If we set aside the %255, sum1 is data[0] + ... + data[count-1]) and sum2 is data[0]*(count) + data[0]*(count-1) + ... + data[count-1]*(1), it becomes easy to create 1,2,3 etc long strings with low values that incur few, if any, %255 operations.

Notice that sum2 is the part that effectively creates different checksums base on order. Should the sum of array elements never reach 255 (this occurs in OP's 4 test cases), sum1 will be the same for any 2 strings that differ only in order.

To effectively "mix up/hash" short strings with low values, a different algorithm is needed.

Perhaps only use a variant when count < 8:

sum1 = (sum1 + index + data[index]) % 255;

Thanks for your detailed answer. For me it sounds like the simpler solution is to add some "seed" to the input data. e.g. just increase the input data with data[0]=254, data[1]='A', data[2]='B'....etc... Of course the receiving end would have to handle the extra byte just as it would need to handle a modified fletcher checksum. — Waxhead, Aug 11 '17 at 18:09
@Waxhead Perhaps `uint16_t sum1 = SEED1; uint16_t sum2 = SEED2;` instead .`fletcher16()` is a compromise of speed, code footprint and quality of result. The best answer depends on typical data streams (ASCII, random binary, etc.), processor abilities (built-in % or not, native integer size), etc. Good luck. — chux - Reinstate Monica, Aug 11 '17 at 18:21

Is Fletchers16 checksum suitable for small data?

1 Answers1

Linked