... we get the same checksum... Is this expected?
Yes, as it is possible. The checksum is 16-bit (and 511 combinations never occur: 0x..FF
, 0xFF..
) so 3 character strings 24-bits will certainly have collisions. Pigeonhole principle
Should not the Fletcher16 checksum account for the order of the blocks?
It does. It is just that the algorithm collides readily with select similar inputs. Also see Hamming distance
BTW, the original algorithm gives different results, if the length or size (also check the null character ) of the string is used. Also, the 4 inputs strings gave a different pair of results.
printf("%x\n", fletcher16("BCA",3)); // 8ec6
printf("%x\n", fletcher16("CAB",3)); // 8ec6 same
printf("%x\n", fletcher16("BAC",3)); // 8cc6
printf("%x\n", fletcher16("ACB",3)); // 8cc6 same
printf("%x\n", fletcher16("BCA",4)); // 55c6
printf("%x\n", fletcher16("CAB",4)); // 55c6 same
printf("%x\n", fletcher16("BAC",4)); // 53c6
printf("%x\n", fletcher16("ACB",4)); // 53c6 same
OP's suggested improvement weakens the checksum also as by or-ing in the index, which disregards select bits at each stage. Suggest xor-ing or adding instead.
Minor nits:
// return (sum2 << 8) | sum1;
return (1u*sum2 << 8) | sum1;
This change is not determental for all int/unsigned
sizes yet avoids implementation defined behavior when int/unsigned
are 16-bit. Best to insure code does not left-shift into the sign bit.
some_int % 255
performs a signed remainder. On devices, simple embedded ones for example, an unsigned remainder is certainly as fast or faster. Nothing to be lost with % 255u
, but potential improvements.
[Edit]
Although OP has "fixed" code for short strings, it defeats a design parameter of fletcher16()
, that of speed of execution.
Details: If we set aside the %255
, sum1
is data[0] + ... + data[count-1]
) and sum2
is data[0]*(count) + data[0]*(count-1) + ... + data[count-1]*(1)
, it becomes easy to create 1,2,3 etc long strings with low values that incur few, if any, %255
operations.
Notice that sum2
is the part that effectively creates different checksums base on order. Should the sum of array elements never reach 255 (this occurs in OP's 4 test cases), sum1
will be the same for any 2 strings that differ only in order.
To effectively "mix up/hash" short strings with low values, a different algorithm is needed.
Perhaps only use a variant when count < 8
:
sum1 = (sum1 + index + data[index]) % 255;