OpenCL on Xeon Phi: 2D Convolution Experience - OpenCL vs OpenMP

Question

The performance of Xeon Phi benchmarked with 2D convolution in opnecl seems much better than an openmp implementation even with compiler-enabled vectorization. Openmp version was run in phi native mode, and timing measured only computation part: For-loop. For the opencl implementation, timing was only for kernel computation as well: no data transfer included. OpenMp-enbaled version was tested with 2,4,60,120,240 threads. - 240 threads gave the best performance for a balanced thread affinity setting. But Opencl was around 17x better even for the 240-thread openmp baseline with pragma-enbled vectorization is source code. Input image size is for 1024x1024 up to 16384x16384, and filter size of 3x3 up to 17x17. In call runs, opencl was better than openmp. Is this an expected speedup of opencl?? Seems too good to be true.

EDIT:

Compilation (openmp)

icc Convolve.cpp -fopenmp -mmic -O3 -vec-report1 -o conv.mic
Convolve.cpp(71): (col. 17) remark: LOOP WAS VECTORIZED

Source (Convole.cpp):

void Convolution_Threaded(float * pInput, float * pFilter, float * pOutput,
          const int nInWidth, const int nWidth, const int nHeight,
          const int nFilterWidth, const int nNumThreads)
{
    #pragma omp parallel for num_threads(nNumThreads)
    for (int yOut = 0; yOut < nHeight; yOut++)
    {
        const int yInTopLeft = yOut;

        for (int xOut = 0; xOut < nWidth; xOut++)
        {
            const int xInTopLeft = xOut;

            float sum = 0;
            for (int r = 0; r < nFilterWidth; r++)
            {
                const int idxFtmp = r * nFilterWidth;

                const int yIn = yInTopLeft + r;
                const int idxIntmp = yIn * nInWidth + xInTopLeft;

                #pragma ivdep           //discards any data dependencies assumed by compiler                                        
                #pragma vector aligned      //all data accessed in the loop is properly aligned
                for (int c = 0; c < nFilterWidth; c++)
                {
                    const int idxF  = idxFtmp  + c;
                    const int idxIn = idxIntmp + c;    
                    sum += pFilter[idxF]*pInput[idxIn];
                }
            } 

            const int idxOut = yOut * nWidth + xOut;
            pOutput[idxOut] = sum;
        } 
    } 
}

Source 2 (convolve.cl)

    __kernel void Convolve(const __global  float * pInput,
                        __constant float * pFilter,
                        __global  float * pOutput,
                        const int nInWidth,
                        const int nFilterWidth)
{
    const int nWidth = get_global_size(0);

    const int xOut = get_global_id(0);
    const int yOut = get_global_id(1);

    const int xInTopLeft = xOut;
    const int yInTopLeft = yOut;

    float sum = 0;
    for (int r = 0; r < nFilterWidth; r++)
    {
        const int idxFtmp = r * nFilterWidth;

        const int yIn = yInTopLeft + r;
        const int idxIntmp = yIn * nInWidth + xInTopLeft;

        for (int c = 0; c < nFilterWidth; c++)
        {
            const int idxF  = idxFtmp  + c;
            const int idxIn = idxIntmp + c;
            sum += pFilter[idxF]*pInput[idxIn];
        }
    }
    const int idxOut = yOut * nWidth + xOut;
    pOutput[idxOut] = sum;
}

Result of OpenMP (in comparison with OpenCL):

            image filter  exec Time (ms)
OpenMP  2048x2048   3x3   23.4
OpenCL  2048x2048   3x3   1.04*

*Raw kernel execution time. Data transfer time over PCI bus not included.

You're using two difference sets of code which have likely been optimized differently. Did you write the OpenMP one yourself? Post the code. It's likely not optimized. Does the OpenCL use vector types (e.g. float4)? These will use SSE/AVX. OpenMP only takes care of threads, if you want to use SSE/AVX you have to do it yourself. — Z boson, Apr 24 '14 at 09:29
Also, the Xeon Phi has its own SIMD (AVX512) which is 512-bits wide which OpenCL can take advantage of. OpenMP won't do this for you. — Z boson, Apr 24 '14 at 11:26
Intel quite likely performs horizontal vectorization of certain kernels which means that a single Xeon Phi core can actually run 16 threads at the same time on a single core (assuming one thread operates on 32bit vals), one per vector element. This naturally depends on the kernel and not all are easily amendable to this sort of processing. — sharpneli, Apr 24 '14 at 12:27
The opencl and openmp benchmarks are same code from the same source: AMD. The opencl code is not manually factorized -- plain scalar code. The framework does the vectorization. For the openmp compilation, the compiler reports that all the loops are vectorized. (--vec-report2). Is it still possible that the code will not use SSE/AVX even when the binary contains vector instructions as reported by the compiler? — nikk, Apr 24 '14 at 14:04
yes, Opencl does easily take advantage on SIMD units. How do you make sure openmp does this (enabled with the compiler hints specified in my code using pragmas)? I am adding the code to the EDIT post. — nikk, Apr 24 '14 at 14:06
Can you show us which loop is ICC indicating is vectorized (i.e. which loop is on line 71)? If this isn't the same loop that you have exposed in your OpenCL NDRange, then this could be where some of the performance discrepancy is coming from. — jprice, Apr 24 '14 at 15:58
hi, the vectorized loop that the compiler is referring to is the inner-most loop (line 71). It is also same loop exposed to opencl vectorization. I have added the Opencl source in the edit section. — nikk, Apr 24 '14 at 16:12
Intel's OpenCL implementation vectorizes by mapping work-items onto SIMD lanes. This means that your OpenCL version of the code is vectorizing across pixels, whereas your OpenMP version is vectorizing the convolution within a single pixel. — jprice, Apr 24 '14 at 16:55
@Zboson Intel Xeon Phi [57]1** (aka Knights Corner) do _not_ support AVX512. That's the next one (Knights Landing). — Jeff Hammond, Feb 03 '15 at 06:18
@Jeff, Kinghts Corner supports 512-bit wide SIMD. Your correct that it's not identical to AVX512 but it's quite similar as far as I understand. If you look at the Intel Intrinsic Guide you can see that most of the intrinsics between Knights Corner and AVX512 have a one to one match. — Z boson, Feb 03 '15 at 08:58

nikk · Answer 1 · 2014-04-24T19:16:33.680

Previously: (with #pragma ivdep and #pragma vector aligned for inner inner-most loop):

Compiler output: 
Convolve.cpp(24): (col. 17) remark: LOOP WAS VECTORIZED

Program output:
120 Cores: 0.0087 ms

After advice by @jprice (with #pragma simd on horizontal-wise data):

Compiler output:
Convolve.cpp(24): (col. 9) remark: **SIMD** LOOP WAS VECTORIZED

Program output:
120 Cores: 0.00305

OpenMP now 2.8X faster compared to its previous execution. A fair comparison can now be made with OpenCL! Thanks jprice and to everyone who contributed. Learnt huge lessons from you all.

EDIT: Here are my results and comparison:

            image   filter  exec Time (ms)
OpenMP  2048x2048   3x3     4.3
OpenCL  2048x2048   3x3     1.04

Speedup: 4.1X

Indeed OpenCL can be this faster than OpenMP ?

You might also try to parallelize over both the x and y outer loops. It won't matter for very large images, but for 1024x1024 on 240 threads, it probably will. You have to make one trivial change for the compiler to accept this: #pragma omp parallel for collapse(2) num_threads(nNumThreads) for (int yOut = 0; yOut < nHeight; yOut++) { for (int xOut = 0; xOut < nWidth; xOut++) { const int yInTopLeft = yOut; const int xInTopLeft = xOut; — Jeff Hammond, Feb 03 '15 at 19:39

jprice · Accepted Answer · 2014-04-24T17:58:46.247

Intel's OpenCL implementation will use what they call "implicit vectorisation" in order to take advantage of vector floating point units. This involves mapping work-items onto SIMD lanes. In your example, each work-item is processing a single pixel, which means that each hardware thread will be processing 16 pixels at a time using the Xeon Phi's 512-bit vector units.

By contrast, your OpenMP code is parallelising across pixels, and then vectorising the computation within a pixel. This is almost certainly where the performance difference is coming from.

In order to get ICC to vectorize your OpenMP code in a manner that is similar to the implicitly vectorised OpenCL code, you should remove your #pragma ivdep and #pragma vector aligned statements from the innermost loop, and instead just place a #pragma simd in front of the horizontal pixel loop:

#pragma omp parallel for num_threads(nNumThreads)
for (int yOut = 0; yOut < nHeight; yOut++)
{
    const int yInTopLeft = yOut;

    #pragma simd
    for (int xOut = 0; xOut < nWidth; xOut++)
    {

When I compile this with ICC, it reports that it is successfully vectorising the desired loop.

score 1 · Answer 3 · answered Aug 22 '14 at 18:40

Your OpenMP program use one thread for a row of image.The pixels in the same row are vectorized. It equals you have one dimension workgroup in OpenCL. Each workgroup process one row of image. But in your OpenCL code, it seems that you have a two dimension workgroup. Each workgroup(mapped into one thread on phi) is processing a BLOCK of the image, not a ROW of image. The cache hit will be different.

OpenCL on Xeon Phi: 2D Convolution Experience - OpenCL vs OpenMP

3 Answers3