If you don't know the size of the buffer, you can't do it without a loop. Even if you don't write the loop yourself, calling something like strlen will result in a loop. I'm counting recursion as a loop here too.
How do you know which bytes to keep and which to set to zero? If these bytes are in known positions, you can use vector operations to zero out some of the bytes and not others. The following example zeros out only the even bytes over the first 64 bytes of rawData
:
__m128i zeros = _mm_setzero_si128();
uint8_t mask[] = {8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0};
__m128i sse_mask = _mm_load_si128(mask);
_mm_maskmoveu_si128(zeros, sse_mask, &rawData[0]);
_mm_maskmoveu_si128(zeros, sse_mask, &rawData[16]);
_mm_maskmoveu_si128(zeros, sse_mask, &rawData[32]);
_mm_maskmoveu_si128(zeros, sse_mask, &rawData[48]);
If the high bit of each byte in mask
is 1, the corresponding value in zeros
will be copied to rawData
. You can use a sequence of these masked copies to quickly replace some bytes and not others. The resulting machine code uses SSE operations, so this is actually quite fast. It's not required, but SSE operations will run much faster if rawData
is 16-byte aligned.
Sorry if you're targeting ARM. I believe the NEON intrinsics are similar, but not identical.