2

I am trying to build 2^n using the double representation. The trick is (well) known

// tips to calculate 2^n using the exponent of the double IEEE representation
union ieee754{
    double d;
    uint32_t i[2];
};

// Converts an unsigned long long to a double
inline double uint642dp(uint64_t ll) {
  ieee754 tmp;
  tmp.ll=ll;
  return tmp.d;
}
-----------------------------------
// e^x = 2^n e^y, n = round(x/log(2)+0.5)

double x = 4.3; // start from there
double y = round(x/log(2)+0.5)
int n = (int)y; // convert in to double 
uint642dp(( ((uint64_t)n) +1023)<<52); // calculate 2^n fastly

if n = 4, it will return 16.

Presently I am looking for the same this but for SIMD calculation. Considering SSE2 double, after the round function I will get a register sse2 __m128d v = (4.0, 3.0); form this register how calculate 2^v ... I am blocked mainly due to the cast __m128d to __m128i, it does not exist (it still exists a cast but it does not move bit, just change the "interpretation" of the register double/int).

I do not want return the datas form simd register to normal register to make the transformation. It exists certainly a tips with SIMD, but I do not know it.

So Help ^_^'

Best,

Timocafé
  • 765
  • 6
  • 18

1 Answers1

2

The trick you are looking for is actually the _mm256_castsi256_pd intrinsic which allows you to convert a SIMD-array of integers into a SIMD-array of doubles. This is only for C/C++ type checking and does not translate into any instruction.

Here is a snippet of code to perform this operation (only valid within some bounds of exponent):

#include <immintrin.h>

__m256d pow2n (__m256i n)
{
    const __m256i bias = _mm256_set1_epi64x( 1023 );
    __m256i t = _mm256_add_epi64 (n, bias);
    t = _mm256_slli_epi64 (t, 52);
    return _mm256_castsi256_pd (t) ;
}

#include <cstdio>

int main ()
{
        __m256i rn = _mm256_set_epi64x( 7, 9, 4, 2 );
        __m256d pn = pow2n (rn) ;
        double v [4] ;
        _mm256_storeu_pd (v, pn) ;
        printf ("v = %lf %lf %lf %lf\n", v[0], v[1], v[2], v[3]) ;
        return 0 ;
}

pow2n compiles to only 2 insns, as you can see on the Godbolt Compiler Explorer

Output is :

v = 4.000000 16.000000 512.000000 128.000000

This requires AVX2, but a 128bit version would only require SSE2. Note the exponents are provided as 64bit integers. If you have 32bit integers, use _mm256_cvtepu32_epi64.

If you're starting with a vector of double, like the OP, then use _mm256_cvtpd_epi32 and _mm256_cvtepu32_epi64. (double -> int64 directly isn't available until AVX-512).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Florent DUGUET
  • 2,786
  • 16
  • 28
  • Nice answer. In C++11, you can just use `alignas(32)` instead of non-standard `__declspec` (GNU C vs. MSVC differ on this). Also, I might have used `_mm256_set_epi64x( 7, 9, 4, 2 )` for the initializer, instead of the `int64_t` array. Also, in English, it's `bias`, not `biais` :) – Peter Cordes May 01 '16 at 16:22
  • @PeterCordes, Thanks. I did not know the alignas(32) from C++11, thanks for mentioning, feel free to edit the post accordingly. For the `_mm256_set_epi64x`, I was being lazy using it, remembering that some compiler headers don't expose it with exact same naming. As for English, you got me ! (just edited the post). – Florent DUGUET May 01 '16 at 16:29
  • I had to check to see if it was `epi64x` or `epi64`. Compilers/linkers do a good of merging vector constants, just like how they merge identical string literals into a single definition, so it's almost always best to use `_mm_set1` or `_mm_set` to assign constants to local variables, rather than putting them in static const storage yourself. (Unfortunately `static const __m128 foo = _mm_set()` sucks: that will run a constructor at startup instead of having the data in `foo` to start with). – Peter Cordes May 01 '16 at 16:41
  • @PeterCordes, thanks a lot for your edit. My compiler version yields a broadcast though, care should be taken for the flags though. – Florent DUGUET May 01 '16 at 16:59
  • Intel's intrinsics guide only lists [`_mm256_set1_epi64x`](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_set1&expand=4618), so hopefully it's safe to use across all compilers. It works for gcc and clang. Loading a 32B constant with a `vpbroadcastq` is probably a good thing, since it's as cheap as a `vmovdqa` on Intel hardware. Any decent compiler will hoist the broadcast-load out of a loop after inlining, so it doesn't have to load it every time through the loop, whether it's with a broadcast or not. – Peter Cordes May 01 '16 at 17:01