1

I am trying to run the following code with different n sizes on an Xeon Phi KNC (with 61 cores and 4T/C) and Xeon (with 2 sockets of Xeon E5-2660 v2).

I am getting the timings as shown in the tables below. However, I am trying to understand why MIC's preformance are poorer than running a Xeon processor. What am I doing wrong here, and how can I fix it (if possible)?

Thanks!

CODE:

program prog
  integer, allocatable :: arr1(:), arr2(:)
  integer :: i, n, time_start, time_end
  n=481
  do while (n .le. 481000000)
    allocate(arr1(n),arr2(n))
    call system_clock(time_start)
    !dir$ offload begin target(mic)
    !$omp SIMD 
    do i=1,n
       arr1(i) = arr1(i) + arr2(i)
    end do
    !dir$ end offload 
    call system_clock(time_end)
    write (,) "n=",n," time=",time_end-time_start
    deallocate(arr1,arr2)
    n = n*10
  end do
end program

Xeon-Phi RESULTS:

 n=         481  time=        8881
 n=        4810  time=          75
 n=       48100  time=          53
 n=      481000  time=         261
 n=     4810000  time=        1991
 n=    48100000  time=       18912
 n=   481000000  time=      188203

Settings:

#!/bin/bash #SBATCH -N 1 #SBATCH -o out_122 #SBATCH --exclusive export MIC_KMP_AFFINITY=verbose,granularity=fine,scatter export MIC_OMP_NUM_THREADS=122 ./prog.exe

sbatch -p xphi -N 1 --exclusive run_par.sh

while all of the settings are in run_par.sh and xphi is the name of the device.

Its also worth mentioning that a native run (addition of !dir$ offload begin target(mic) before the !$omp SIMD) yields a much better results.

n= 481       time= 0 
n= 4810      time= 0 
n= 48100     time= 6 
n= 481000    time= 55 
n= 4810000   time= 455 
n= 48100000  time= 4342 
n= 481000000 time= 43322

In the native run rhe settings are:

#!/bin/bash #SBATCH -N 1 #SBATCH -o out_244_native #SBATCH --exclusive export SINK_LD_LIBRARY_PATH=...intel/compilers_and_libraries/linux/lib/mic:$SINK_LD_LIBRARY_PATH micnativeloadex ./prog.exe.MIC -e "KMP_AFFINITY=verbose,granularity=fine,scatter"

Xeon RESULTS:

 n=         481         time=           0
 n=        4810         time=           0
 n=       48100         time=           2
 n=      481000         time=          19
 n=     4810000         time=          93
 n=    48100000         time=         706
 n=   481000000         time=        7006

Here is the output of lscpu command on my Xeon machine:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                40
On-line CPU(s) list:   0-39
Thread(s) per core:    2
Core(s) per socket:    10
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Model name:            Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
Stepping:              4
CPU MHz:               1203.382
BogoMIPS:              4405.99
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0-9,20-29
NUMA node1 CPU(s):     10-19,30-39

My MIC specs are (tail of /proc/cpuinfo):

processor       : 239
vendor_id       : GenuineIntel
cpu family      : 11
model           : 1
model name      : 0b/01
stepping        : 3
cpu MHz         : 1052.630
cache size      : 512 KB
physical id     : 0
siblings        : 240
core id         : 59
cpu cores       : 60
apicid          : 239
initial apicid  : 239
fpu             : yes
fpu_exception   : yes
cpuid level     : 4
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr mca pat fxsr htsyscall nx lm nopl lahf_lm
bogomips        : 2112.44
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:
Gal Oren
  • 31
  • 3

0 Answers0