Understanting an efficient CRC-CCITT-16 implementation

Question

I came across an allegedly very efficient and elegant CRC implementation and I am trying to really understand all the steps. I understand the CRC-CCITT 0x1021 implementations that iterate over each bit, but I am struggling to get this one. This is the code.

/*
* Original Code by Ashley Roll 
*/

uint16_t crc16(uint8_t* data, uint16_t length){

  uint8_t x;
  uint16_t crc = 0xFFFF;

  while (length--){
      x = crc >> 8 ^ *data++;
      x ^= x>>4;
      crc = (crc << 8) ^ ((uint16_t)(x << 12)) ^ ((uint16_t)(x <<5)) ^ ((uint16_t)x);
  }

  return crc;
}

Could anyone explain to me a little bit deeper what is going on with the x variable? This is what I was able to grasp until now

x = crc >> 8 ^ *data++; // Here, we take the high byte of the register
                        // and we XOR it with the next 8 bits of data. Why?
x ^= x>>4;              // Then, we XOR it with its last four bits?
                        // And we always remain with 4 possible non-zero bits, right?
                        // Why do we do this?
crc = (crc << 8) ^ ((uint16_t)(x << 12)) ^ ((uint16_t)(x <<5)) ^ ((uint16_t)x);
                        // Finally, we drop the oldest (highest) byte from the register
                        // And XOR it with rotations of x, according to the polynomial
                        // 0x1021, which is x^16 + x^12 + x^5 + 1
                        // Is (16-12) = 4 the reason why x has 4 possible non-zero bits?

I guess that this algorithm is just the equivalent of 8 loops of a bitwise algorithm but I would appreciate some clarification. Thanks for your time.

For an "efficient" CRC, this is highly questionable. I'd expect an efficient version to use nibble- or byte-wise lookup tables. Also the casts to `uint16_t` are fishy on a 32 bit system. Picking up code smell. — Lundin, Nov 11 '19 at 08:46

rcgldr · Accepted Answer · 2019-11-12T00:35:39.580

It would have helped if the code included a comment that x is the quotient to be used to process 8 bits at a time using a binary borrowless divide.

The crc divides the data + 16 pad bits of zero by poynomial, and the remainder is the CRC.

poly = x^16 + x^12 + x^5 + 1 = 0x11021 = 10001000000100001 (binary)

To process 8 bits at a time, each step divides the bits aaaabbbb0000000000000000 by 10001000000100001, where aaaabbbb is the upper 8 bits of the crc xor'ed with the next data byte. This can be done in a single divide step by noting the quotient = x = (aaaabbbb)^(aaaabbbb>>4) = aaaacccc, where c = a^b, so a^c = b, and a^b^c = 0:

                                    aaaacccc
                  --------------------------
10001000000100001 | aaaabbbb0000000000000000
                                    aaaacccc
                               aaaacccc
                        aaaacccc
                    aaaacccc
                            ----------------
                            cccbaaacbbbacccc

In the questions code, the bits above x^16 are not generated, since it is known they will cancel out bits x^16 through x^23. The left shift of 12 shifts off the upper 4 bits, corresponding to bits x^16 to x^19, and there is no left shift of 16.

example: data = 0x31 0x32 = 00110001 00110010

crc = 1111111111111111
x = (crc>>8)^00110001 = 11001110
q = (x)^(x>>4) = 11000010

                                   11000010
                  -------------------------
10001000000100001 |110011100000000000000000
                                   11000010
                              11000010
                       11000010
                   11000010
                           ----------------
                           0011100010000010

crc = (crc<<8)^0011100010000010 = 1100011110000010
x = (crc>>8) ^ 00110010 = 11110101
q = (x)^(x>>4) = 11111010

                                   11111010
                  -------------------------
10001000000100001 |111101010000000000000000
                                   11111010
                              11111010
                       11111010
                   11111010
                           ----------------
                           1011111110111010

crc = (crc<<8)^1011111110111010 = 0011110110111010 = 0x3dba

As commented, a table look up should be faster. If a cpu has a fast carryless multiply, such as X86 pclmulqdq, then a faster still assembly program can be used, but it's over 500 lines long (to "fold" 128 bytes at a time in parallel using xmm registers). The assembly code below doesn't document the constants (rk...); they are powers of 2 modulo the polynomial used for "folding", except rk7 = "(2^n)/polynomial", and rk8 = polynomial.

https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf

https://github.com/intel/isa-l/blob/master/crc/crc16_t10dif_01.asm

I converted the code to Visual Studio ML64.EXE (MASM) and Windows, and have source code for crc16, crc32, crc64, non-reflected (non-reversed) and reflected).

score 2 · Answer 2 · answered Oct 12 '20 at 11:37

This is a bit late, I appreciate. I developed the original code this was based on back in 2002. I've since taken that web site down, but the Internet Archive has a copy. There are some details on why and how it works here:

https://web.archive.org/web/20160408050435/http://www.digitalnemesis.com/info/codesamples/embeddedcrc16/

This was originally a 4-bit table, and was developed for 8-bit PIC microcontrollers with only a few hundred bytes of RAM.

The link contains the original description, and the table generator program source as well as C and ASM implementations for the PIC.

Of course, this is terrible code if you're trying to do this on a modern PC. It was 'efficient' compared to the alternative, computing it bit but bit on the target device.

Understanting an efficient CRC-CCITT-16 implementation

2 Answers2