2

With bitchar[] is an array of 0 and 1, I want to flip the sign of in[i] if bitchar[i] = 1 (scrambling):

float *in = get_in();
float *out = get_out();
char *bitchar = get_bitseq();
for (int i = 0; i < size; i++) {
   out[i] = in[i] * (1 - 2 * bitchar[i]);
}

My AVX code:

__m256 xmm1 = _mm256_set_ps(1);
__m256 xmm2 = _mm256_set_ps(2);
for (int i = 0; i < size; i+=8) {
   __m256 xmmb = _mm256_setr_ps (bitchar[i+7], bitchar[i+6], bitchar[i+5], bitchar[i+4], bitchar[i+3], bitchar[i+2], bitchar[i+1], bitchar[i]);
   __m256 xmmpm1 = _mm256_sub_ps(xmm1, _mm256_mul_ps(xmm2,xmmb));
   __m256 xmmout = _mm256_mul_ps(_mm256_load_ps(&in[i]),xmmpm1);
   _mm256_store_ps(&out[i],xmmout);
}

However, the AVX code is not much faster, sometimes even slower. Maybe my avx is not optimal. Could anyone help me?

Anna Noie
  • 31
  • 3
  • 1
    to toggle the sign you just need to XOR the value with 0x80000000 using `_mm256_xor_ps` instead of lots of instructions like that – phuclv Nov 21 '20 at 14:57
  • @AdrianMole yes, it is `in[i]` – Anna Noie Nov 21 '20 at 15:01
  • @phuclv thanks, I do agree. However, the xor trick depends on how the sign bit is implemented in a specific architecture. I will use it as the last resort. – Anna Noie Nov 21 '20 at 15:03
  • 7
    @AnnaNoie AVX always uses IEEE-754 so there's no other better way. Since you're already using intrinsics there's zero reason to write portable code that are format-independent – phuclv Nov 21 '20 at 15:05
  • @phuclv ok. Could you suggest an optimal way to load an array of 0/1 stored in `char` to +0.0 and -0.0? – Anna Noie Nov 21 '20 at 15:19
  • Do you have AVX2 or just AVX? With AVX this should be doable using `_mm256_cvtepu8_epi32` and a shift. – chtz Nov 21 '20 at 15:35
  • @chtz: ITYM "with *AVX2* this should be doable...: – Paul R Nov 21 '20 at 16:56
  • @chtz I did it. However, AVX does not provide shift intrinsics. Only AVX2. I used SSE4.1 and it works OK. Thanks. – Anna Noie Nov 21 '20 at 17:04
  • (sorry, miss-typed. obviously, I meant AVX2 ...) – chtz Nov 21 '20 at 17:21
  • 1
    If you're storing whole bytes, can you make them 0 / -1 so you can sign-extend to 32-bit with `_mm256_cvtepi8_epi32`, and AND instead of shift? That would require a separate vector constant, and `vpand` is only better than `vpslld ymm, ymm, 31` if you're doing this in a loop mixed with FP math operations that would compete for ports with the shift. – Peter Cordes Nov 21 '20 at 22:03
  • @PeterCordes You are right. Unfortunately I cannot change the way scrambling code (the `bitchar[]`) is stored. – Anna Noie Nov 21 '20 at 22:28

1 Answers1

0

Thank everyone for the hints. I came up with this solution using SSE4.1. Any better solution will be appriciated.

    const int size4 = (size / 4) * 4;
    for (int i = 0; i < size4; i += 4) {
        __m128i xmm1 = _mm_cvtepu8_epi32((__m128i) _mm_loadu_ps((float *) &bitchar[i]));
        __m128 xmm2 = (__m128) _mm_slli_epi32(xmm1, 31);
        __m128 xmm3 = _mm_xor_ps(xmm2, _mm_loadu_ps(&in[i]));
        _mm_storeu_ps(&out[i], xmm3);
    }
    for (int i = size4; i < size; i++) {
        out[i] = in[i] * (1 - 2 * bitchar[i]);
    }
Anna Noie
  • 31
  • 3
  • Normally you'd use `_mm_loadu_si128((const __m128i *) &bitchar[i])`. (Or actually an intrinsic to load 4 or 8 bytes into a __m128i, not 16 bytes, but compilers often suck at folding such loads into a memory source for `vpmovzxbd`) And you could use `_mm256_cvtepu8_epi32` to convert to unpack to a 256-bit vector, if you have AVX2 as well as AVX. – Peter Cordes Nov 21 '20 at 22:07
  • @PeterCordes thanks. I will switch to `_mm_loadu_si128`. And my target machine does not have AVX2. – Anna Noie Nov 21 '20 at 22:29