CRC Calculation for 256 bit chunks

Question

I'm using 256 bit variables (__m256i type) in new version of my program on AVX2 and I use Intel intrinsics. Before, 64 bit chunks are used for processing the data. So, _mm_crc32_u64 function is used for CRC calculation.

crc = _mm_crc32_u64(seed,*chunk_64bit);

But now, in order to improve performance I want to calculate CRC for each 256 bit chunks (at least 128 bit chunks) seperately. One way can be like that apply _mm_crc32_u64 in a loop with 64 bit values at each chunks. But I think it is not beneficial in terms of performance.

What is the best method for calculating CRC over 256 bit chunk (or 128 bit) which is faster than _mm_crc32_u64 operation in total ?

Intel has the details [here](http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf). "faster than _mm_crc32_u64" doesn't happen, but there's the naive way to use it (just chain it) and the fast way (see link, crc32 used in parallel with itself) — harold, Apr 12 '17 at 11:16

score 1 · Answer 1 · edited May 23 '17 at 12:02

1

You can interleave three crc32 instructions for higher performance. See this answer for code that does that. You can take it a step further by running that code on multiple processors and combining the resulting CRCs.

edited May 23 '17 at 12:02

Community

1
1

answered Apr 12 '17 at 23:32

Mark Adler

101,978
13
118
158

CRC Calculation for 256 bit chunks

1 Answers1