Why is Qpar faster than OpenMP?

Question

I have a series of benchmarks that carry out the same calculations via CUDA, Multiple Threads and OpenMP, currently being tested via Windows 8.1. The threaded program required MS Compiler Version 18.00 for x64, from Visual Studio 2013, to produce full SIMD speeds but did not for AVX. See:

How can I improve performance compiling for SSE and AVX?

The OpenMP version produced the slower SISD speed (see above link) via an earlier compiler but there was no improvement with version 18.00. Then I discovered Qpar MS Auto-Parallelizer, which produced full speed SIMD via VS 2013. One of the test loops is shown below, along with pragma directives and compile options used.

#pragma  ??
for(i=0; i < n; i++)
{
   x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f;
}

OpenMP -  #pragma omp parallel for          Compile option /openmp
Qpar   -  #pragma loop(hint_parallel(4))    Compile option /Qpar

Qpar thread count has to be constant but program repeats caclculation
code with different count value for run time parameters 1, 2, 4, 8, 16
OpenMP uses /Affinity command to select 1 core (or more).

Results on a quad core i7 4820K with 10 MB L3 cache and running at 3.9 GHZ are below. The 400 KB and 4 MB tests mainly run via L3 cache but the largest data size is influenced by RAM speed. Maximum Qpar speed is quite good at 6.12 MFLOPS per MHz per core out of a maximum of 8.

Are there other compile options or directives that will force OpenMP to produce SSE SIMD instructions?

Test           4 Byte   Ops/  Repeat   Seconds  MFLOPS     First    All
                Words   Word  Passes                     Results   Same

OpenMP         100000      8    2500  0.333250    6002   0.957117   Yes
1 Thread      1000000      8     250  0.325936    6136   0.995517   Yes
             10000000      8      25  0.327862    6100   0.999549   Yes

OpenMP         100000      8    2500  0.086657   23079   0.957117   Yes
4 Threads     1000000      8     250  0.082454   24256   0.995517   Yes
             10000000      8      25  0.083116   24063   0.999549   Yes


Qpar           100000      8    2500  0.081774   24458   0.957117   Yes
1 Thread      1000000      8     250  0.083037   24086   0.995517   Yes
             10000000      8      25  0.101802   19646   0.999549   Yes

Qpar           100000      8    2500  0.023911   83644   0.957117   Yes
4 Threads     1000000      8     250  0.020935   95535   0.995517   Yes
             10000000      8      25  0.050972   39237   0.999549   Yes

The original Linux version obtained similar speeds to Windows but recompilation with gcc 4.8.2 under Ubuntu 14.04 provided much faster results, but not as good as Qpar. Then, gcc AVX option made up the difference. See below:

All OpenMP Linux

Test           4 Byte   Ops/  Repeat   Seconds  MFLOPS     First    All
                Words   Word  Passes                     Results   Same

Old            100000      8    2500  0.326685    6122   0.957117   Yes
1 Thread      1000000      8     250  0.325421    6146   0.995517   Yes
             10000000      8      25  0.328084    6096   0.999549   Yes

Old            100000      8    2500  0.088871   22505   0.957117   Yes
4 Threads     1000000      8     250  0.085748   23324   0.995517   Yes
Data in & o  10000000      8      25  0.086515   23117   0.999549   Yes


New            100000      8    2500  0.151160   13231   0.957117   Yes
1 Thread      1000000      8     250  0.149263   13399   0.995517   Yes
SSE          10000000      8      25  0.156914   12746   0.999549   Yes

New            100000      8    2500  0.043920   45537   0.957117   Yes
4 Threads     1000000      8     250  0.039289   50905   0.995517   Yes
SSE          10000000      8      25  0.053432   37431   0.999549   Yes


New            100000      8    2500  0.075476   26499   0.957117   Yes
1 Thread      1000000      8     250  0.073838   27086   0.995517   Yes
AVX          10000000      8      25  0.096666   20690   0.999549   Yes

New            100000      8    2500  0.022043   90734   0.957117   Yes
4 Threads     1000000      8     250  0.019575  102169   0.995517   Yes
AVX          10000000      8      25  0.052228   38294   0.999549   Yes

It's not surprising that MSVC's auto-vectorization might not work with OpenMP. MSVC only supports OpenMP 2.0 (the latest version is 4.0) which came out over a decade ago. On top of that MSVC's auto-vectorization is rather new itself. Microsoft wants you to use their proprietor tools (e.g. Qpar) and lock you into their system which does not interface well with other systems. If your goal is optimal code maybe you should consider GCC. It's free both in terms of freedom and in terms of free beer. — Z boson, Jun 26 '14 at 13:53
If you want SIMD code with OpenMP and MSVC than use intrinsics and vectorize it yourself (that's what I do). — Z boson, Jun 26 '14 at 13:55

Why is Qpar faster than OpenMP?

0 Answers0