I have a series of benchmarks that carry out the same calculations via CUDA, Multiple Threads and OpenMP, currently being tested via Windows 8.1. The threaded program required MS Compiler Version 18.00 for x64, from Visual Studio 2013, to produce full SIMD speeds but did not for AVX. See:
How can I improve performance compiling for SSE and AVX?
The OpenMP version produced the slower SISD speed (see above link) via an earlier compiler but there was no improvement with version 18.00. Then I discovered Qpar MS Auto-Parallelizer, which produced full speed SIMD via VS 2013. One of the test loops is shown below, along with pragma directives and compile options used.
#pragma ??
for(i=0; i < n; i++)
{
x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f;
}
OpenMP - #pragma omp parallel for Compile option /openmp
Qpar - #pragma loop(hint_parallel(4)) Compile option /Qpar
Qpar thread count has to be constant but program repeats caclculation
code with different count value for run time parameters 1, 2, 4, 8, 16
OpenMP uses /Affinity command to select 1 core (or more).
Results on a quad core i7 4820K with 10 MB L3 cache and running at 3.9 GHZ are below. The 400 KB and 4 MB tests mainly run via L3 cache but the largest data size is influenced by RAM speed. Maximum Qpar speed is quite good at 6.12 MFLOPS per MHz per core out of a maximum of 8.
Are there other compile options or directives that will force OpenMP to produce SSE SIMD instructions?
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
OpenMP 100000 8 2500 0.333250 6002 0.957117 Yes
1 Thread 1000000 8 250 0.325936 6136 0.995517 Yes
10000000 8 25 0.327862 6100 0.999549 Yes
OpenMP 100000 8 2500 0.086657 23079 0.957117 Yes
4 Threads 1000000 8 250 0.082454 24256 0.995517 Yes
10000000 8 25 0.083116 24063 0.999549 Yes
Qpar 100000 8 2500 0.081774 24458 0.957117 Yes
1 Thread 1000000 8 250 0.083037 24086 0.995517 Yes
10000000 8 25 0.101802 19646 0.999549 Yes
Qpar 100000 8 2500 0.023911 83644 0.957117 Yes
4 Threads 1000000 8 250 0.020935 95535 0.995517 Yes
10000000 8 25 0.050972 39237 0.999549 Yes
The original Linux version obtained similar speeds to Windows but recompilation with gcc 4.8.2 under Ubuntu 14.04 provided much faster results, but not as good as Qpar. Then, gcc AVX option made up the difference. See below:
All OpenMP Linux
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
Old 100000 8 2500 0.326685 6122 0.957117 Yes
1 Thread 1000000 8 250 0.325421 6146 0.995517 Yes
10000000 8 25 0.328084 6096 0.999549 Yes
Old 100000 8 2500 0.088871 22505 0.957117 Yes
4 Threads 1000000 8 250 0.085748 23324 0.995517 Yes
Data in & o 10000000 8 25 0.086515 23117 0.999549 Yes
New 100000 8 2500 0.151160 13231 0.957117 Yes
1 Thread 1000000 8 250 0.149263 13399 0.995517 Yes
SSE 10000000 8 25 0.156914 12746 0.999549 Yes
New 100000 8 2500 0.043920 45537 0.957117 Yes
4 Threads 1000000 8 250 0.039289 50905 0.995517 Yes
SSE 10000000 8 25 0.053432 37431 0.999549 Yes
New 100000 8 2500 0.075476 26499 0.957117 Yes
1 Thread 1000000 8 250 0.073838 27086 0.995517 Yes
AVX 10000000 8 25 0.096666 20690 0.999549 Yes
New 100000 8 2500 0.022043 90734 0.957117 Yes
4 Threads 1000000 8 250 0.019575 102169 0.995517 Yes
AVX 10000000 8 25 0.052228 38294 0.999549 Yes