I'm using Intel Cilk Plus array notation to practice vector programming. However, I met strange copy behaviors of array assignments.
The problem to solve is parallel prefix. D is input, and P is output.
//Wrong result code. Tested on Intel icc 14.0.2.
void vec_prefix(int n, int D[n], int P[n]) {
P[0:n] = D[0:n]; //initial copy
int bound=1;
int * Buf1 = _mm_malloc(n*sizeof(int), 16);
int * Buf2 = _mm_malloc(n*sizeof(int), 16);
while(bound<n){
Buf1[0:n-bound] = P[0:n-bound];
Buf2[0:n-bound] = P[bound:n-bound];
//printf("WHY??\n"); //Add this fence, the result will be correct.
P[bound:n-bound] = Buf1[0:n-bound] + Buf2[0:n-bound];
bound<<=1;
}
_mm_free(Buf1); _mm_free(Buf2);
}
If I remove the comment of the line "printf", the result is correct. Otherwise, wrong. But if all copies follow the descriptions in the Intel's document, the code should be correct.
It seems if there is no such as memory fence, the first two lines' Buf1/Buf2 copy are not finished, and the add operation later uses some unstable value. Or, the compiler optimizer just use copy-propagation to remove the copy, and create "P[bound:n-bound] = P[0:n-bound] + P[bound:n-bound]". This is undefined in Intel's document.
//Correct result code
void vec_prefix(int n, int D[n], int P[n]) {
P[0:n] = D[0:n]; //initial copy
int bound=1;
int * Buf1 = _mm_malloc(n*sizeof(int), 16);
while(bound<n){
//direct copy part
Buf1[0:bound] = P[0:bound];
//add part
Buf1[bound:n-bound] = P[bound:n-bound] + P[0:n-bound];
//copy back
P[0:n] = Buf1[0:n];
bound<<=1;
}
_mm_free(Buf1);
}
Ref: Intel's document