SSE vectorization of math 'pow' function gcc

Question

I was trying to vectorize a loop that contains the use of the 'pow' function in the math library. I am aware intel compiler supports use of 'pow' for sse instructions - but I can't seem to get it to run with gcc ( I think ). This is the case I am working with:

int main(){
        int i=0;
        float a[256],
        b[256];

        float x= 2.3;


        for  (i =0 ; i<256; i++){
                a[i]=1.5;
        }

        for (i=0; i<256; i++){
                b[i]=pow(a[i],x);
        }

        for (i=0; i<256; i++){
                b[i]=a[i]*a[i];
        }
    return 0;

}

I'm compiling with the following:

gcc -O3 -Wall -ftree-vectorize -msse2 -ftree-vectorizer-verbose=5 code.c -o runthis

This is on os X 10.5.8 using gcc version 4.2 (I used 4.5 as well and couldn't tell if it had vectorized anything - as it didn't output anything at all). It appears that none of the loops vectorize - is there an allignment issue or some other issue that I need t use restrict? If I write one of the loops as a function I get slightly more verbose output(code):

void pow2(float *a, float * b, int n) {
        int i;
        for (i=0; i<n; i++){
                b[i]=a[i]*a[i];
        }
}

output (using level 7 verbose output):

note: not vectorized: can't determine dependence between *D.2878_13 and *D.2877_8
bad data dependence.

I looked at the gcc auto-vectorization page but that didnt' help to much. If it is not possible to use pow in the gcc version what where could I find the resource to do a pow - equivalent function (I'm mostly dealing with integer powers).

Edit so I was just digging into so other source- how did it vectorize this?!:

void array_op(double * d,int len,double value,void (*f)(double*,double*) ) { 
    for ( int i = 0; i < len; i++ ){
        f(&d[i],&value);
    }
};

The relevant gcc output:

note: Profitability threshold is 3 loop iterations.

note: LOOP VECTORIZED.

Well now I'm at a loss -- 'd' and 'value' are modified by a function that gcc doesn't know about - strange? Maybe I need to test this portion a little more thoroughly to make sure the results are correct for the vectorized portion. Still looking for a vectorized math library - why aren't there any open source ones?

Optimizing your `main` to a `return 0` is normal: Nothing outside of `main` can observe the result, so optimizing the the loops away entirely doesn't change anything about the program behaviour. The arrays were local variables with automatic storage, so there are no side-effects like calls to malloc/free for the compiler to preserve, either. — Peter Cordes, Jan 19 '18 at 01:42

Damon · Answer 1 · 2011-08-02T20:49:03.653

5

Using __restrict or consuming inputs (assigning to local vars) before writing outputs should help.

As it is now, the compiler cannot vectorize because a might alias b, so doing 4 multiplies in parallel and writing back 4 values might not be correct.

(Note that __restrict won't guarantee that the compiler vectorizes, but so much can be said that right now, it sure cannot).

edited Aug 02 '11 at 20:49

answered Aug 02 '11 at 20:37

Damon

67,688
20
135
185

1

Even if this doesn't solve this problem it is worth taking the time to understand what is recommended here and *why*, because until you do you will never be able to guess what the compiler can and cannot optimize. – dmckee --- ex-moderator kitten Aug 02 '11 at 21:45
This certainly works for the simple b=a*a, but what about the pow function? Could you point me to a resource. I don't want (/have the time) to write my own - but if I have to that wouldn't be the worst thing. – Marm0t Aug 02 '11 at 23:41
Something like [this](http://martin.ankerl.com/2007/10/04/optimized-pow-approximation-for-java-and-c-c/)? – Damon Aug 03 '11 at 15:55
@Damon Wouldn't disabling strict aliasing rules not only make problems with vectorizing and most certainly other optimizations? Though I assume one could change the code accordingly for some small performance hit. – Voo Aug 03 '11 at 22:23
@Voo: Disabling strict aliasing is a different thing. Using `__restrict` you tell the compiler that the particular two pointers in this function do not alias each other (actually you say that they are not aliased at all, but what matters most is that they don't alias each other). This is something that's true and obvious to you, but not obvious to the compiler. If the compiler does not know, it must assume the worst case. The worst case is that they _do alias_, which means that many assumptions for optimizations are not safe to assume. This hinders vectorization and requires redundant loads. – Damon Aug 04 '11 at 07:36
For example, if a and b alias each other, you cannot just read a value from one and assume that it is the same value at a later time if the other was written to. Instead, you _must_ fetch the value anew each time. You also cannot just fetch 4 values and do some calculations on some other 4 values and write those back, and assume that the result will be correct. Consuming inputs (as described above) before writing outputs is another solution if you feel uneasy tampering with aliasing. If all inputs are consumed into temporaries, the compiler knows that their values are consistent. – Damon Aug 04 '11 at 07:39
@Damon No you misunderstood me (I understand what __restrict does, sadly with MS VC not that useful for me :( ), but the code you linked in your comment above mine, will violate strict aliasing rules and therefore only work with -fno-strict-aliasing. While that approximation may surely speed up the pow computation, I assume for a larger program the loss of strict aliasing will more than make up for that. Though one could surely fix the code with some struct magic. – Voo Aug 04 '11 at 14:36

Stephen Canon · Answer 2 · 2011-08-02T21:02:46.210

This is not really an answer to your question; but rather a suggestion for how might be able to avoid this issue entirely.

You mention that you're on OS X; there are already APIs on that platform that provide the operations you're looking at, without any need for auto-vectorization. Is there some reason that you aren't using them instead? Auto-vectorization is really cool, but it requires some work, and in general it doesn't produce results that are as good as using APIs that are already vectorized for you.

#include <string.h>
#include <Accelerate/Accelerate.h>

int main() {

    int n = 256;
    float a[256],
    b[256];

    // You can initialize the elements of a vector to a set value using memset_pattern:
    float threehalves = 1.5f;
    memset_pattern4(a, &threehalves, 4*n);

    // Since you have a fixed exponent for all of the base values, we will use
    // the vImage gamma functions.  If you wanted to have different exponents
    // for each input (i.e. from an array of exponents), you would use the vForce
    // vvpowf( ) function instead (also part of Accelerate).
    //
    // If you don't need full accuracy, replace kvImageGamma_UseGammaValue with
    // kvImageGamma_UseGammaValue_half_precision to get better performance.
    GammaFunction func = vImageCreateGammaFunction(2.3f, kvImageGamma_UseGammaValue, 0);
    vImage_Buffer src = { .data = a, .height = 1, .width = n, .rowBytes = 4*n };
    vImage_Buffer dst = { .data = b, .height = 1, .width = n, .rowBytes = 4*n };
    vImageGamma_PlanarF(&src, &dst, func, 0);
    vImageDestroyGammaFunction(func);

    // To simply square a instead, use the vDSP_vsq function.
    vDSP_vsq(a, 1, b, 1, n);

    return 0;
}

More generally, unless your algorithm is quite simple, auto-vectorization is unlikely to deliver great results. In my experience, the spectrum of vectorization techniques usually looks about like this:

better performance                                            worse performance
more effort                                                         less effort
+------+------+----------------------+----------------------------+-----------+
|      |      |                      |                            |           |
|      |  use vectorized APIs        |                   auto vectorization   |
|  skilled vector C                  |                              scalar code
hand written assembly       unskilled vector C

I'm developing on OS X, but not necessarily developing for os x (The intended target is centos 5). Also I'm trying not to pull in many other libraries. Thanks for the OS X api though - I may just use that for other things I'm working. — Marm0t, Aug 02 '11 at 22:01
@Misha: I'm assuming that people who are going to bother know what they're doing, but yes =) — Stephen Canon, Aug 03 '11 at 05:14

SSE vectorization of math 'pow' function gcc

2 Answers2

Linked