Conditionally flip sign of float with SSE and/or AVX

Question

With bitchar[] is an array of 0 and 1, I want to flip the sign of in[i] if bitchar[i] = 1 (scrambling):

float *in = get_in();
float *out = get_out();
char *bitchar = get_bitseq();
for (int i = 0; i < size; i++) {
   out[i] = in[i] * (1 - 2 * bitchar[i]);
}

My AVX code:

__m256 xmm1 = _mm256_set_ps(1);
__m256 xmm2 = _mm256_set_ps(2);
for (int i = 0; i < size; i+=8) {
   __m256 xmmb = _mm256_setr_ps (bitchar[i+7], bitchar[i+6], bitchar[i+5], bitchar[i+4], bitchar[i+3], bitchar[i+2], bitchar[i+1], bitchar[i]);
   __m256 xmmpm1 = _mm256_sub_ps(xmm1, _mm256_mul_ps(xmm2,xmmb));
   __m256 xmmout = _mm256_mul_ps(_mm256_load_ps(&in[i]),xmmpm1);
   _mm256_store_ps(&out[i],xmmout);
}

However, the AVX code is not much faster, sometimes even slower. Maybe my avx is not optimal. Could anyone help me?

to toggle the sign you just need to XOR the value with 0x80000000 using `_mm256_xor_ps` instead of lots of instructions like that — phuclv, Nov 21 '20 at 14:57
@phuclv thanks, I do agree. However, the xor trick depends on how the sign bit is implemented in a specific architecture. I will use it as the last resort. — Anna Noie, Nov 21 '20 at 15:03
@AnnaNoie AVX always uses IEEE-754 so there's no other better way. Since you're already using intrinsics there's zero reason to write portable code that are format-independent — phuclv, Nov 21 '20 at 15:05
@phuclv ok. Could you suggest an optimal way to load an array of 0/1 stored in `char` to +0.0 and -0.0? — Anna Noie, Nov 21 '20 at 15:19
Do you have AVX2 or just AVX? With AVX this should be doable using `_mm256_cvtepu8_epi32` and a shift. — chtz, Nov 21 '20 at 15:35
@chtz I did it. However, AVX does not provide shift intrinsics. Only AVX2. I used SSE4.1 and it works OK. Thanks. — Anna Noie, Nov 21 '20 at 17:04
If you're storing whole bytes, can you make them 0 / -1 so you can sign-extend to 32-bit with `_mm256_cvtepi8_epi32`, and AND instead of shift? That would require a separate vector constant, and `vpand` is only better than `vpslld ymm, ymm, 31` if you're doing this in a loop mixed with FP math operations that would compete for ports with the shift. — Peter Cordes, Nov 21 '20 at 22:03
@PeterCordes You are right. Unfortunately I cannot change the way scrambling code (the `bitchar[]`) is stored. — Anna Noie, Nov 21 '20 at 22:28

score 0 · Answer 1 · answered Nov 21 '20 at 17:17

0

Thank everyone for the hints. I came up with this solution using SSE4.1. Any better solution will be appriciated.

    const int size4 = (size / 4) * 4;
    for (int i = 0; i < size4; i += 4) {
        __m128i xmm1 = _mm_cvtepu8_epi32((__m128i) _mm_loadu_ps((float *) &bitchar[i]));
        __m128 xmm2 = (__m128) _mm_slli_epi32(xmm1, 31);
        __m128 xmm3 = _mm_xor_ps(xmm2, _mm_loadu_ps(&in[i]));
        _mm_storeu_ps(&out[i], xmm3);
    }
    for (int i = size4; i < size; i++) {
        out[i] = in[i] * (1 - 2 * bitchar[i]);
    }

answered Nov 21 '20 at 17:17

Anna Noie

31
3

Normally you'd use `_mm_loadu_si128((const __m128i *) &bitchar[i])`. (Or actually an intrinsic to load 4 or 8 bytes into a __m128i, not 16 bytes, but compilers often suck at folding such loads into a memory source for `vpmovzxbd`) And you could use `_mm256_cvtepu8_epi32` to convert to unpack to a 256-bit vector, if you have AVX2 as well as AVX. – Peter Cordes Nov 21 '20 at 22:07
@PeterCordes thanks. I will switch to `_mm_loadu_si128`. And my target machine does not have AVX2. – Anna Noie Nov 21 '20 at 22:29

Conditionally flip sign of float with SSE and/or AVX

1 Answers1