I am asked to vectorize a larger program. Before I started with the big program I wanted to see the effect of vectorization in isolated case. For this I created two programs that should show the idea of the outstanding transformation. One with an array of structs (no vec) and struct of arrays (with vec). I expected that the soa would outperform the aos by far, but it doesn't.
measured program loop A
for (int i = 0; i < NUM; i++) {
ptr[i].c = ptr[i].a + ptr[i].b;
}
full program:
#include <cstdlib>
#include <iostream>
#include <stdlib.h>
#include <chrono>
using namespace std;
using namespace std::chrono;
struct myStruct {
double a, b, c;
};
#define NUM 100000000
high_resolution_clock::time_point t1, t2, t3;
int main(int argc, char* argsv[]) {
struct myStruct *ptr = (struct myStruct *) malloc(NUM * sizeof(struct myStruct));
for (int i = 0; i < NUM; i++) {
ptr[i].a = i;
ptr[i].b = 2 * i;
}
t1 = high_resolution_clock::now();
for (int i = 0; i < NUM; i++) {
ptr[i].c = ptr[i].a + ptr[i].b;
}
t2 = high_resolution_clock::now();
long dur = duration_cast<microseconds>( t2 - t1 ).count();
cout << "took "<<dur << endl;
double sum = 0;
for (int i = 0; i < NUM; i++) {
sum += ptr[i].c;
}
cout << "sum is "<< sum << endl;
}
measured program loop B
#pragma simd
for (int i = 0; i < NUM; i++) {
C[i] = A[i] + B[i];
}
full program:
#include <cstdlib>
#include <iostream>
#include <stdlib.h>
#include <omp.h>
#include <chrono>
using namespace std;
using namespace std::chrono;
#define NUM 100000000
high_resolution_clock::time_point t1, t2, t3;
int main(int argc, char* argsv[]) {
double *A = (double *) malloc(NUM * sizeof(double));
double *B = (double *) malloc(NUM * sizeof(double));
double *C = (double *) malloc(NUM * sizeof(double));
for (int i = 0; i < NUM; i++) {
A[i] = i;
B[i] = 2 * i;
}
t1 = high_resolution_clock::now();
#pragma simd
for (int i = 0; i < NUM; i++) {
C[i] = A[i] + B[i];
}
t2 = high_resolution_clock::now();
long dur = duration_cast<microseconds>( t2 - t1 ).count();
cout << "Aos "<<dur << endl;
double sum = 0;
for (int i = 0; i < NUM; i++) {
sum += C[i];
}
cout << "sum "<<sum;
}
I compile with
icpc vectorization_aos.cpp -qopenmp --std=c++11 -cxxlib=/lrz/mnt/sys.x86_64/compilers/gcc/4.9.3/
icpc (v16) compiled and executed on an Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz
in my test cases program A takes around 300ms, B 350ms. If I add unnecessary additional data to the struct in A it becomes increasingly slower (as more memory has to be loaded) the -O3 flag does not have any impact on run-time removing the #pragma simd directive does also not have impact. So either its auto vectorized or my vectorization does not work at all.
Questions:
am I missing something? Is this the way how one would vectorize a program?
Why is program 2 slower? Maybe the program is both times just memory bandwidth bound and I need to increase the computation density?
Are there programs/ code snippets that show the impact of vecotrization better and how can I verify that my program is actually executed vectorized.