4

I am asked to vectorize a larger program. Before I started with the big program I wanted to see the effect of vectorization in isolated case. For this I created two programs that should show the idea of the outstanding transformation. One with an array of structs (no vec) and struct of arrays (with vec). I expected that the soa would outperform the aos by far, but it doesn't.


measured program loop A

for (int i = 0; i < NUM; i++) {
    ptr[i].c = ptr[i].a + ptr[i].b;
}

full program:

#include <cstdlib>
#include <iostream>
#include <stdlib.h>

#include <chrono>
using namespace std;
using namespace std::chrono;


struct myStruct {
    double a, b, c;
};
#define NUM 100000000

high_resolution_clock::time_point t1, t2, t3;

int main(int argc, char* argsv[]) {
    struct myStruct *ptr = (struct myStruct *) malloc(NUM * sizeof(struct myStruct));

    for (int i = 0; i < NUM; i++) {
        ptr[i].a = i;
        ptr[i].b = 2 * i;
    }
    t1 = high_resolution_clock::now();
    for (int i = 0; i < NUM; i++) {
        ptr[i].c = ptr[i].a + ptr[i].b;
    }
    t2 = high_resolution_clock::now();
    long dur = duration_cast<microseconds>( t2 - t1 ).count();
    cout << "took "<<dur << endl;
    double sum = 0;
    for (int i = 0; i < NUM; i++) {
        sum += ptr[i].c;
    }
    cout << "sum is "<< sum << endl;

}

measured program loop B

#pragma simd 
for (int i = 0; i < NUM; i++) {
    C[i] = A[i] + B[i];
}

full program:

#include <cstdlib>
#include <iostream>
#include <stdlib.h>
#include <omp.h>
#include <chrono>

using namespace std;
using namespace std::chrono;

#define NUM 100000000

high_resolution_clock::time_point t1, t2, t3;

int main(int argc, char* argsv[]) {
    double *A = (double *) malloc(NUM * sizeof(double));
    double *B = (double *) malloc(NUM * sizeof(double));
    double *C = (double *) malloc(NUM * sizeof(double));
    for (int i = 0; i < NUM; i++) {
        A[i] = i;
        B[i] = 2 * i;
    }


    t1 = high_resolution_clock::now();
    #pragma simd
    for (int i = 0; i < NUM; i++) {
        C[i] = A[i] + B[i];
    }
    t2 = high_resolution_clock::now();
    long dur = duration_cast<microseconds>( t2 - t1 ).count();
    cout << "Aos "<<dur << endl;

    double sum = 0;
    for (int i = 0; i < NUM; i++) {
        sum += C[i];
    }
    cout << "sum "<<sum;
}

I compile with

icpc vectorization_aos.cpp -qopenmp --std=c++11 -cxxlib=/lrz/mnt/sys.x86_64/compilers/gcc/4.9.3/

icpc (v16) compiled and executed on an Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz

in my test cases program A takes around 300ms, B 350ms. If I add unnecessary additional data to the struct in A it becomes increasingly slower (as more memory has to be loaded) the -O3 flag does not have any impact on run-time removing the #pragma simd directive does also not have impact. So either its auto vectorized or my vectorization does not work at all.

Questions:

  • am I missing something? Is this the way how one would vectorize a program?

  • Why is program 2 slower? Maybe the program is both times just memory bandwidth bound and I need to increase the computation density?

  • Are there programs/ code snippets that show the impact of vecotrization better and how can I verify that my program is actually executed vectorized.

wally
  • 10,717
  • 5
  • 39
  • 72
  • 2
    Did you [enable optimizations](https://software.intel.com/en-us/articles/step-by-step-optimizing-with-intel-c-compiler)? What exact optimization flags? – Basile Starynkevitch Dec 03 '16 at 13:57
  • 2
    mark your pointers `A`, `B` and `C` with `restrict`. – Yakov Galka Dec 03 '16 at 14:16
  • 2
    [Sometimes the SSE instructions are slower](http://stackoverflow.com/a/35923973/1460794). – wally Dec 03 '16 at 14:17
  • @BasileStarynkevitch I tested without optimization, as well as -O3. Those did not have any meaningful impact on run-time. I assume because the program is too simple to make additional optimizations. – Andreas Schmelz Dec 03 '16 at 15:07
  • I read the guide on https://software.intel.com/sites/default/files/m/4/8/8/2/a/31848-CompilerAutovectorizationGuide.pdf with the flag -vec-report you'll get a report that shows that both programs A and B are auto-vectorized. Then you can disable vectorization by -no-vec. program A seems to be faster because if reads memory continous. program b reads 3 continous blocks thus a little slower. – Andreas Schmelz Dec 03 '16 at 15:18
  • @ybungalobill I tried it without any impact on performance. – Andreas Schmelz Dec 03 '16 at 15:18
  • I found this if you're interested : http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0193r0.html – Mathieu Van Nevel Dec 03 '16 at 16:44
  • If you have that vec-report then you may find that the compiler is better than the average attempt. Some prefetch pragmas or safelen in OMP can also help. Basically if the compiler is vectorising it, then why look at doing it yourself? If restructuring the code gets the compiler to vectorize it then that is easiest. – Holmz Dec 03 '16 at 21:40

0 Answers0