Computing x^y with GCC vector intrinsics

Question

Suppose I have a 2 element vector defines as follows (using the GCC syntax for packed vectors)

// packed vector of 2-elements
typedef double v2d __attribute__((vector_size(sizeof(double)*2)));

v2d x = ...;
double y = ...;

x[0] = pow(x[0], y)
x[1] = pow(x[1], y)

I'd like to know if there's a faster way to do the two power computations using vector operations. The architecture is GCC on x86-64 and platform specific code is OK.

Implementing a generic power function is difficult as it is since you may need both `exp()` and `log()`. There may likely be too much branching to be able to get a worthwhile speedup via vectorizing. But I'm just speculating though. — Mysticial, Nov 16 '12 at 21:44
No, the SIMD instruction set doesn't have any operations that allow speeding up pow(). SSE2 only has add, sub, mul, div, max, min and sqrt. There's not even a non-vectorized instruction for it. — Hans Passant, Nov 16 '12 at 22:04
There may be some hope if "y" is limited to unsigned int instead of double. Indeed, with the classic "shift-and-multiply" algorithm the two vector's elements could be evaluated in parallel. Just my guess. — Giuseppe Guerrini, Nov 16 '12 at 23:16
[SSE vectorization of math `pow` function gcc](https://stackoverflow.com/q/6918141/995714), [`pow` for SSE types](https://stackoverflow.com/q/25936031/995714) — phuclv, Jan 18 '18 at 16:11

score 5 · Answer 1 · answered Nov 17 '12 at 00:32

5

Yes, this should be possible if you have no special cases (negative numbers, 0, 1, NaN etc...) so that the code path is linear.

Here is the generic code for the pow function for IEEE754 doubles, it has no looping constructs, so if you flesh out all the special cases, vectorization seems straightforward. Have fun.

answered Nov 17 '12 at 00:32

Gunther Piez

29,760
6
71
103

2

Haha... I see what you mean by "have fun". :) – Mysticial Nov 17 '12 at 03:19

Z boson · Answer 2 · 2018-01-18T13:21:03.400

You can loop over the elements directly and with the right options GCC and ICC will use a vectorized pow function

#include <math.h>
typedef double vnd __attribute__((vector_size(sizeof(double)*2)));

vnd foo(vnd x, vnd y) {
    #pragma omp simd
    for(int i=0; i<2; i++) x[i] = pow(x[i], y[i]); 
    return x;
}

With just -O2 ICC generates simply call __svml_pow2. SVML (Short Vector Math Library) is Intel's vectorized math library. With -Ofast -fopenmp GCC generates simply call _ZGVbN2vv___pow_finite.

Clang does not vectorize it.

https://godbolt.org/g/pjpzFX

Computing x^y with GCC vector intrinsics

2 Answers2