0

I am doing some performance test on Xeon phi using cilk plus with offload.

In a simple vector add program I have 2 ways to do it:

  1. using cilk_for to split tasks to different threads in Xeon phi:

    __declspec(target(mic)) void vector_add(double *A,double *B,double *C,
    int  vector_size)
    {
       _Cilk_for(int i=0;i<vector_size;i++)
        {
        C[i] +=  A[i] + B[i];
        }
    }
    double *A,*B,*C;
    //allocating and initializing  A, B ,C using malloc.....
    #pragma offload target(mic:0) \
    in(B:length(vector_size)) \
    in(A:length(vector_size)) \
    in(C:length(vector_size)) \
    in(vector_size ) 
    {
      vector_add(A,B,C,vector_size);
    }
    
  2. Using vector annotation:

    double *A,*B,*C;
    //allocating and initializing  A, B ,C using malloc.....
    #pragma offload target(mic:0) \
    in(B:length(vector_size)) \
    in(A:length(vector_size)) \
    in(C:length(vector_size))
    //in(vector_size )
    //signal(offload0)
    {
      C[0:vector_size] = A[0:vector_size]+B[0:vector_size];
    }
    

My test show the first way is ~10x faster than the second way on xeon phi. Same story occurs when I do not offload and run it on a Xeon E5 host CPU.

First I want to know whether my understanding is correct:

The first way only exploit thread parallelism(60 cores*4 thread each) in XEON phi. But no vector operation will be performed.

The second way only exploit vectorization that it will only run this code in one thread and using SIMD(IMCI) instructions.

Second I would like to know what is the correct way to write this such that it will both split tasks to different threads and use vector instructions on Xeon phi?

Thanks in advance.

yidiyidawu
  • 303
  • 1
  • 3
  • 12
  • If that's really the entire loop, and the arrays are large, then it's completely memory bound. Using 1 thread on the CPU may be about as fast as it gets. – Mysticial Apr 11 '15 at 02:50

1 Answers1

1

Actually, if you look at the optimization reports the compiler produces (-opt-report), or at the VTune output if you have that, you might be surprised. Your second example does, as you surmised, vectorize only. However, you first example can also vectorize in addition to parallelizing. Remember that _Cilk_for is not handing out individual iterations but chunks of iterations which can in some cases be vectorized.

For better control, you can try playing around with nesting loops to explicitly separate the parallel and vector loops or playing around with grain size to change the amount of work a thread has to work on at any given time or a number of different pragmas.

For advice on optimizing specifically for the Intel Xeon Phi coprocessor, I like to point people to https://software.intel.com/en-us/articles/programming-and-compiling-for-intel-many-integrated-core-architecture, but I think you might find some of that too basic. Still, if you feel like digging around....

froth
  • 319
  • 1
  • 6
  • Thank you for your answer . I am new to Vtune, How do I see vectorization from there? I run hotspot analysis on knight corner(mic) and I only see stacks and threads time. – yidiyidawu Apr 14 '15 at 22:56