0

I'm using 256 bit variables (__m256i type) in new version of my program on AVX2 and I use Intel intrinsics. Before, 64 bit chunks are used for processing the data. So, _mm_crc32_u64 function is used for CRC calculation.

crc = _mm_crc32_u64(seed,*chunk_64bit);

But now, in order to improve performance I want to calculate CRC for each 256 bit chunks (at least 128 bit chunks) seperately. One way can be like that apply _mm_crc32_u64 in a loop with 64 bit values at each chunks. But I think it is not beneficial in terms of performance.

What is the best method for calculating CRC over 256 bit chunk (or 128 bit) which is faster than _mm_crc32_u64 operation in total ?

  • Intel has the details [here](http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf). "faster than _mm_crc32_u64" doesn't happen, but there's the naive way to use it (just chain it) and the fast way (see link, crc32 used in parallel with itself) – harold Apr 12 '17 at 11:16

1 Answers1

1

You can interleave three crc32 instructions for higher performance. See this answer for code that does that. You can take it a step further by running that code on multiple processors and combining the resulting CRCs.

Community
  • 1
  • 1
Mark Adler
  • 101,978
  • 13
  • 118
  • 158