5

I have a boolean expression that I have managed to implement in SSE2. Now I would have liked to try implementing it in AVX exploiting an additional factor 2 in parallelism increase (from 128 bit SIMD type to 256). However, AVX does not support integer operation (which AVX2 does, but I am working on a Sandy Bridge processor so it is not an option currently). However, since there are AVX intrinsics for bitwise operations. I figured I could make a try by just converting my integer types to float types and see if it works.

First test was a success:

__m256 ones = _mm256_set_ps(1,1,1,1,1,1,1,1);
__m256 twos = _mm256_set_ps(2,2,2,2,2,2,2,2); 
__m256 result = _mm256_and_ps(ones, twos);

I'm guetting all 0's as I am supposed to. Simularly AND'ing the twos instead I get a result of 2. But when trying 11 XOR 4 accordingly:

__m256 elevens = _mm256_set_ps(11,11,11,11,11,11,11,11); 
__m256 fours = _mm256_set_ps(4,4,4,4,4,4,4,4); 
__m256 result2 = _mm256_xor_ps(elevens, fours); 

The result is 6.46e-46 (i.e. close to 0) and not 15. Simularly doing 11 OR 4 gives me a value of 22 and not 15 as it should be. I don't understand why this is. Is it a bug or some configuration I am missing?

I was actually expecting my hypothesis of working with float as if they were integers to not work since the integer initialized to a float value might not actually be the precise value but a close approximation. But even then, I am surprised by the result I get.

Does anyone have a solution to this problem or must I upgrade my CPU to get AVX2 support enable this?

Toby999
  • 584
  • 1
  • 5
  • 14
  • It sounds like you're printing an integer as a float to get 6.46e-46. Are you sure your `printf()` formatting specifiers are correct? – 1'' Dec 11 '13 at 19:12
  • I was not printing. I just checked the value in the the Visual Studio debugger. – Toby999 Dec 11 '13 at 19:21

2 Answers2

7

The first test worked by accident.

1 as a float is 0x3f800000, 2 is 0x40000000. In general, it wouldn't work that way.

But you can absolutely do it, you just have to make sure that you're working with the right bit-pattern. Don't convert your integers to floats - reinterpret-cast them. That corresponds to intrinsics such as _mm256_castsi256_ps, or storing your ints to memory and reading them as floats (that won't change them, in general only math operations care about what the floats mean, the rest work with the raw bit patterns, check the list of exceptions that an instruction can make to make sure).

harold
  • 61,398
  • 6
  • 86
  • 164
  • Aha. Thanks. That makes sense. I give it a try and mark your answer as correct if it works. – Toby999 Dec 11 '13 at 19:23
  • 2
    @Toby999 But be aware that on all current Intel processors, the floating-point versions of the bitwise logic instructions have only 1/3 the throughput as the integer versions. So if you're doing this for performance, you might want to think twice. It might even backfire unless you're limited by decoder bandwidth. – Mysticial Dec 11 '13 at 19:33
  • 2
    On Sandy and Ivy Bridge, integer SSE bitwise logic can go to any of ports 0, 1, or 5 at one/cycle. That's 3 per cycle. But floating-point SSE bitwise logic can only go to port 5 at one/cycle. So it's limited to 1 per cycle. On Haswell, it's the same, but it has AVX2 - which makes the point moot. – Mysticial Dec 11 '13 at 19:36
  • You can use the the AVX integer load and store operations (e.g. `_mm256_loadu_si256`) with AVX you just can't do the integer operations (e.g. `_mm256_add_epi32`) with AVX. So you should be able to load in the integergs and then use `_mm256_and_ps`. – Z boson Dec 11 '13 at 19:37
  • Thanks for the extra input. After having successfully implemented an initial version of the full math expression indeed, the AVX version has less throughput than the SSE2 version. I guess it due to your explanation Mystical. I wasn't expecting much extra anyways since I am quite closer to the memory read max bandwidth anyways I think. Disappointing though. ;( – Toby999 Dec 12 '13 at 17:30
5

You don't need AVX2 to use the AVX integer load and store operations: see the intel intrinsic guide. So you can load your integers using AVX, reinterpret-cast to float, use float bitwise operations, and then reinterpret-cast back to int. The reinterpret-casts don't generate any instructions, they just make the compiler happy. Try this:

//compiled and ran on an Ivy Bridge system with AVX but without AVX2
#include <stdio.h>
#include <immintrin.h>
int main() {
    int a[8] = {0, 2, 4, 6, 8, 10, 12, 14};
    int b[8] = {1, 1, 1, 1, 1,  1,  1,  1};
    int c[8];

    __m256i a8 = _mm256_loadu_si256((__m256i*)a);
    __m256i b8 = _mm256_loadu_si256((__m256i*)b);
    __m256i c8 = _mm256_castps_si256(
        _mm256_or_ps(_mm256_castsi256_ps(a8), _mm256_castsi256_ps(b8)));
    _mm256_storeu_si256((__m256i*)c, c8);
    for(int i=0; i<8; i++) printf("%d ", c[i]); printf("\n");
    //output: 1 3 5 7 9 11 13 15
}

Of course, as Mystical pointed out this might not be worth doing but that does not mean you can't do it.

Z boson
  • 32,619
  • 11
  • 123
  • 226
  • Thanks for your input. It was helpful since it is time consuming digging out the correct intrinsics methods. – Toby999 Dec 12 '13 at 17:31
  • there are options for aligning variables so you don't need to deal with unaligned loads – phuclv Sep 18 '14 at 09:18
  • @LưuVĩnhPhúc, I was working with the assumption that it does not matter any more. The throughput and latency of the aligned and unaligned load/store instructions is the same on aligned memory. That's the theory. But in practice I'm still seeing a difference so I agree with you that the aligned load instructions should be used. – Z boson Sep 18 '14 at 09:21