Strange execution time in Debug and Release versions

Question

i started to play with Parallel Pattern Library in VS2010 the application gives me expected results but when i benchmark the debug version and release version i get strange execution time in Release version as follow Debug Version : "Sequential Duration : 1014 " "Parallel Duration : 437 " Release Version "Sequential Duration : 31 " "Parallel Duration : 484 "

this is my application code

double DoWork(int workload)
{
    double result=0;
    for(int i =0 ; i < workload;i++)
    {
        result +=sqrt((double)i * 4*3) + i* i;
    }
    return result;
}

vector<double> Seqential()
{
    vector<double> results(100);
    for(int i = 0 ; i <100 ; i++)
    {
        results[i] = DoWork(1000000);
    }

    return results;
}

vector<double> Parallel()
{
     vector<double> results(100);
     parallel_for(0,(int)100,1,[&results](int i)
     {
         results[i] = DoWork(1000000);
     });

     return results;
}

double Sum(const vector<double>& results)
{
    double result =0;
    for(int i = 0 ; i < results.size();i++)
        result += results[i];
    return result;
}

int main()
{
    DWORD start = GetTickCount();
    vector<double> results = Seqential();
    DWORD duration = GetTickCount() - start;
    cout<<"Sequential Duration : "<<duration <<"  Result : " <<Sum(results) << endl;

    start = GetTickCount();
    results = Parallel();
    duration = GetTickCount() - start;
    cout<<"Prallel Duration : "<<duration <<"  Result : " <<Sum(results) << endl;
    system("PAUSE");
    return 0;
}

score 2 · Answer 1 · answered Mar 03 '12 at 05:04

2

IIRC, C++11 allows the compiler to go pretty deep into functions to precompute constant expressions at compile time, even functions like sqrt. So your Sequential version could be getting optimized all the way to a table of results. You might want to look at the generated assembly for Sequential, if possible, and see if it looks overly simplified, or possibly optimized away entirely.

There's nothing about DoWork that can't be computed at compile time.

answered Mar 03 '12 at 05:04

Mike DeSimone

41,631
10
72
96

Adding a `volatile` qualifier to `workload` or `result` may help. – J.N. Mar 03 '12 at 05:08
Adding `volatile` to `i` would make the most difference, if the compiler doesn't call your bluff. – Mike DeSimone Mar 03 '12 at 12:04

score 1 · Answer 2 · answered Mar 03 '12 at 05:01

1

What's likely happening is that the overhead of spawning multiple threads is taking more time than simply computing the results. In a Release build, the compiler is able to do a lot of optimization, so the amount of work being done inside DoWork is considerably smaller compared to the amount of work it takes to set up a thread and tear it down.

If you make DoWork do much more work (such as by looping many times), you'll see results that match your expectations more closely.

answered Mar 03 '12 at 05:01

Adam Rosenfield

390,455
97
512
589

Yep. His code measures the cost of thread creation which, not surprisingly, is much higher in a debug build than a release build. – David Schwartz Mar 03 '12 at 05:07
1

@David Schwartz: Huh? The threaded Parallel run takes *longer* to run in Release than in Debug; it's like it de-optimizes in Release by almost 20%. The Sequential run uses no threads and runs 30 times faster in Release than Debug, a more expected result. – Mike DeSimone Mar 03 '12 at 12:01
1

Increasing the work doesn't change the relative performance difference. Also, the number of threads is small (theoretically at most 100, but actually equal to your number of cores in practice), too small for thread setup/tear-down to matter. – Branko Dimitrijevic Mar 03 '12 at 12:50

score 1 · Accepted Answer · answered Mar 03 '12 at 15:04

The problem is not in Parallel being slow but in Seqential being too fast:

In Seqential, the compiler sees that DoWork will always produce the same result, so the loop calling it 100 times is optimized away and DoWork ends-up being called only once.
Compiler is not clever enough to optimize the parallel_for in quite the same way, so it ends-up doing the actual work (100 times more actual work, in fact).

If you make DoWork dependent on the loop counter, different calls will now produce different results, so no calls will be redundant, so there will be nothing for compiler to optimize-away.

For example:

#include <vector>
#include <iostream>
#include <math.h>
#include <ppl.h>
#include <Windows.h>

using namespace std;
using namespace Concurrency;

double DoWork(int workload, int outer_i)
{
double result=0;
for(int i =0 ; i < workload;i++)
{
    result +=sqrt((double)i * 4*3) + i* i;
}
result += outer_i;
return result;
}

vector<double> Seqential()
{
vector<double> results(100);
for(int i = 0 ; i <100 ; i++)
{
    results[i] = DoWork(1000000, i);
}

return results;
}

vector<double> Parallel()
{
vector<double> results(100);
parallel_for(0,(int)100,1,[&results](int i)
{
    results[i] = DoWork(1000000, i);
});

return results;
}

double Sum(const vector<double>& results)
{
double result =0;
for(int i = 0 ; i < results.size();i++)
    result += results[i];
return result;
}

int main()
{
DWORD start = GetTickCount();
vector<double> results = Seqential();
DWORD duration = GetTickCount() - start;
cout<<"Sequential Duration : "<<duration <<"  Result : " <<Sum(results) << endl;

start = GetTickCount();
results = Parallel();
duration = GetTickCount() - start;
cout<<"Prallel Duration : "<<duration <<"  Result : " <<Sum(results) << endl;
system("PAUSE");
return 0;
}

When built by Visual C++ 2010 under Release configuration and run on a quad-core CPU, this prints:

Sequential Duration : 1607  Result : 1.68692e+015
Prallel Duration : 374  Result : 1.68692e+015

(BTW, you should really consider formatting your code better.)

thanks for your info now i got in Parallel time half of of Sequential but i was excepted to get almost four time faster than sequential in i5 processor also tell me your notes about my code format — Ma7moud El-Naggar, Mar 03 '12 at 16:11
@Ma7moudEl-Naggar Some i5 processors have 2 cores with 4 hyper-threads, which is different from "full" 4 cores. Do you happen to have such a processor? BTW, the above measurements were done on a Core 2 Quad at 2.5 GHz. — Branko Dimitrijevic, Mar 03 '12 at 17:01
@Ma7moudEl-Naggar Regarding code formatting, you should start by indenting your code properly. It also wouldn't hurt to use the whitespaces consistently... — Branko Dimitrijevic, Mar 03 '12 at 17:03

Strange execution time in Debug and Release versions

3 Answers3