I am doing some performance test on Xeon phi using cilk plus with offload.
In a simple vector add program I have 2 ways to do it:
using cilk_for to split tasks to different threads in Xeon phi:
__declspec(target(mic)) void vector_add(double *A,double *B,double *C, int vector_size) { _Cilk_for(int i=0;i<vector_size;i++) { C[i] += A[i] + B[i]; } } double *A,*B,*C; //allocating and initializing A, B ,C using malloc..... #pragma offload target(mic:0) \ in(B:length(vector_size)) \ in(A:length(vector_size)) \ in(C:length(vector_size)) \ in(vector_size ) { vector_add(A,B,C,vector_size); }
Using vector annotation:
double *A,*B,*C; //allocating and initializing A, B ,C using malloc..... #pragma offload target(mic:0) \ in(B:length(vector_size)) \ in(A:length(vector_size)) \ in(C:length(vector_size)) //in(vector_size ) //signal(offload0) { C[0:vector_size] = A[0:vector_size]+B[0:vector_size]; }
My test show the first way is ~10x faster than the second way on xeon phi. Same story occurs when I do not offload and run it on a Xeon E5 host CPU.
First I want to know whether my understanding is correct:
The first way only exploit thread parallelism(60 cores*4 thread each) in XEON phi. But no vector operation will be performed.
The second way only exploit vectorization that it will only run this code in one thread and using SIMD(IMCI) instructions.
Second I would like to know what is the correct way to write this such that it will both split tasks to different threads and use vector instructions on Xeon phi?
Thanks in advance.