Unexplained Xeon-Phi Overhead

Question

I am trying to run this code with these different n sizes on an Xeon Phi KNC. I am getting the timings as shown in the table, but I have no idea why I am experiencing those fluctuations. Can you please guide me through it? Thanks in advance.

CODE:

program prog
  integer, allocatable :: arr1(:), arr2(:)
  integer :: i, n, time_start, time_end
  n=481
  do while (n .le. 481000000)
    allocate(arr1(n),arr2(n))
    call system_clock(time_start)
    !dir$ offload begin target(mic)
    !$omp SIMD 
    do i=1,n
       arr1(i) = arr1(i) + arr2(i)
    end do
    !dir$ end offload 
    call system_clock(time_end)
    write (,) "n=",n," time=",time_end-time_start
    deallocate(arr1,arr2)
    n = n*10
  end do
end program

RESULTS:

 n=         481  time=        8881
 n=        4810  time=          75
 n=       48100  time=          53
 n=      481000  time=         261
 n=     4810000  time=        1991
 n=    48100000  time=       18912
 n=   481000000  time=      188203

Settings: `#!/bin/bash #SBATCH -N 1 #SBATCH -o out_122 #SBATCH --exclusive export MIC_KMP_AFFINITY=verbose,granularity=fine,scatter export MIC_OMP_NUM_THREADS=122 ./prog.exe` — Gal Oren, Jul 06 '18 at 10:11
`sbatch -p xphi -N 1 --exclusive run_par.sh`, while all of the settings are in `run_par.sh` and `xphi` is the name of the device. — Gal Oren, Jul 07 '18 at 08:16
Its also worth mentioning that a **native** run (addition of `!dir$ offload begin target(mic)` before the `!$omp SIMD` yields a much better results. `n= 481 time= 0 n= 4810 time= 0 n= 48100 time= 6 n= 481000 time= 55 n= 4810000 time= 455 n= 48100000 time= 4342 n= 481000000 time= 43322` — Gal Oren, Jul 07 '18 at 08:25
In the **natuve** run rhe settings are: `#!/bin/bash #SBATCH -N 1 #SBATCH -o out_244_native #SBATCH --exclusive export SINK_LD_LIBRARY_PATH=...intel/compilers_and_libraries/linux/lib/mic:$SINK_LD_LIBRARY_PATH micnativeloadex ./prog.exe.MIC -e "KMP_AFFINITY=verbose,granularity=fine,scatter"` — Gal Oren, Jul 07 '18 at 08:27

score 1 · Answer 1 · answered Jul 09 '18 at 08:29

The first offload (n=481) will certainly be slow because that is where you are offloading all of the code and initialising the process on the KNC. If you don't want to see that do an empty offload before you start timing things.

At the high end (>=481000), things seem sane; each run is ~10x slower than the one before, so the only oddities now are the scaling of the lower ones. It's possible that some of that is related to load imbalance. If you have a 60 core processor and are running 4T/C (you didn't give us this information), 4810 iterations => ~20 iterations/core which means the SIMD performance is likely to be poor,as you have 16 lanes. Given misalignment you may only be executing a lead-in and lead-out, and nothing at full width!)

Unexplained Xeon-Phi Overhead

1 Answers1

Linked