The performance of Xeon Phi benchmarked with 2D convolution in opnecl seems much better than an openmp implementation even with compiler-enabled vectorization. Openmp version was run in phi native mode, and timing measured only computation part: For-loop. For the opencl implementation, timing was only for kernel computation as well: no data transfer included. OpenMp-enbaled version was tested with 2,4,60,120,240 threads. - 240 threads gave the best performance for a balanced thread affinity setting. But Opencl was around 17x better even for the 240-thread openmp baseline with pragma-enbled vectorization is source code. Input image size is for 1024x1024 up to 16384x16384, and filter size of 3x3 up to 17x17. In call runs, opencl was better than openmp. Is this an expected speedup of opencl?? Seems too good to be true.
EDIT:
Compilation (openmp)
icc Convolve.cpp -fopenmp -mmic -O3 -vec-report1 -o conv.mic
Convolve.cpp(71): (col. 17) remark: LOOP WAS VECTORIZED
Source (Convole.cpp):
void Convolution_Threaded(float * pInput, float * pFilter, float * pOutput,
const int nInWidth, const int nWidth, const int nHeight,
const int nFilterWidth, const int nNumThreads)
{
#pragma omp parallel for num_threads(nNumThreads)
for (int yOut = 0; yOut < nHeight; yOut++)
{
const int yInTopLeft = yOut;
for (int xOut = 0; xOut < nWidth; xOut++)
{
const int xInTopLeft = xOut;
float sum = 0;
for (int r = 0; r < nFilterWidth; r++)
{
const int idxFtmp = r * nFilterWidth;
const int yIn = yInTopLeft + r;
const int idxIntmp = yIn * nInWidth + xInTopLeft;
#pragma ivdep //discards any data dependencies assumed by compiler
#pragma vector aligned //all data accessed in the loop is properly aligned
for (int c = 0; c < nFilterWidth; c++)
{
const int idxF = idxFtmp + c;
const int idxIn = idxIntmp + c;
sum += pFilter[idxF]*pInput[idxIn];
}
}
const int idxOut = yOut * nWidth + xOut;
pOutput[idxOut] = sum;
}
}
}
Source 2 (convolve.cl)
__kernel void Convolve(const __global float * pInput,
__constant float * pFilter,
__global float * pOutput,
const int nInWidth,
const int nFilterWidth)
{
const int nWidth = get_global_size(0);
const int xOut = get_global_id(0);
const int yOut = get_global_id(1);
const int xInTopLeft = xOut;
const int yInTopLeft = yOut;
float sum = 0;
for (int r = 0; r < nFilterWidth; r++)
{
const int idxFtmp = r * nFilterWidth;
const int yIn = yInTopLeft + r;
const int idxIntmp = yIn * nInWidth + xInTopLeft;
for (int c = 0; c < nFilterWidth; c++)
{
const int idxF = idxFtmp + c;
const int idxIn = idxIntmp + c;
sum += pFilter[idxF]*pInput[idxIn];
}
}
const int idxOut = yOut * nWidth + xOut;
pOutput[idxOut] = sum;
}
Result of OpenMP (in comparison with OpenCL):
image filter exec Time (ms)
OpenMP 2048x2048 3x3 23.4
OpenCL 2048x2048 3x3 1.04*
*Raw kernel execution time. Data transfer time over PCI bus not included.