TBB lambda vs self written body object

Question

I was experimenting with the below piece of code to compare the performance of serial and parallel for (both non-lambda and lambda).

#include<iostream>
#include<chrono>
#include <ctime>
#include<fstream>
#include<stdlib.h>
#define MAX 10000000
#include "tbb/tbb.h"
#include "tbb/task_scheduler_init.h"

using namespace std;
using namespace tbb;
void squarecalc(int a)
{
    a *= a;
}
void serial_apply_square(int* a)
{
    for (int i = 0; i<MAX; i++)
        squarecalc(*(a + i));
}

class apply_square
{
    int* my_a;
public:
    void operator()(const blocked_range<size_t>& r) const
    {
        int *a = my_a;
        for (size_t i = r.begin(); i != r.end(); ++i)
            squarecalc(a[i]);
    }
    apply_square(int* a) :my_a(a){}
};
void parallel_apply_square(int* a, size_t n)
{
    parallel_for(blocked_range<size_t>(0, n), apply_square(a));
}
void parallel_apply_square_lambda(int* a, size_t n)
{
    parallel_for(blocked_range<size_t>(0, n), [=](const blocked_range<size_t>& r)
    {
        for (size_t i = r.begin(); i != r.end(); ++i)
            squarecalc(a[i]);
    }
    );
}

int main()
{
    std::chrono::time_point<std::chrono::system_clock> start, end;
    int i = 0;
    int* a = new int[MAX];

    fstream of;
    of.open("newfile", ios::in);
    while (i<MAX)
    {
        of >> a[i];
        i++;
    }

    start = std::chrono::system_clock::now();
    serial_apply_square(a);
    end = std::chrono::system_clock::now();

    std::chrono::duration<double> elapsed_seconds = end - start;
    cout << "\nTime for serial execution  :" << elapsed_seconds.count() << endl;

    start = std::chrono::system_clock::now();
    parallel_apply_square(a, MAX);
    end = std::chrono::system_clock::now();

    elapsed_seconds = end - start;
    cout << "\nTime for parallel execution [without lambda]  :" << elapsed_seconds.count() << endl;

    start = std::chrono::system_clock::now();
    parallel_apply_square_lambda(a, MAX);
    end = std::chrono::system_clock::now();

    elapsed_seconds = end - start;
    cout << "\nTime for parallel execution [with lambda] :" << elapsed_seconds.count() << endl;
    free(a);
}

In short it just computes the square of 10000000 numbers in both serial and parallel ways. Below is the output that I got for multiple executions of the object code.

**1st execution**

Time for serial execution  :0.043183

Time for parallel execution [without lambda]  :0.035238

Time for parallel execution [with lambda]  :0.036719

**2nd execution**

Time for serial execution  :0.043252

Time for parallel execution [without lambda]  :0.035403

Time for parallel execution [with lambda]  :0.036811

**3rd execution**

Time for serial execution  :0.043241

Time for parallel execution [without lambda]  :0.035355

Time for parallel execution [with lambda]  :0.036558

**4th execution**

Time for serial execution  :0.043216

Time for parallel execution [without lambda]  :0.035491

Time for parallel execution [with lambda]  :0.036697

Thought the parallel execution times is lesser than the serial execution times for all the cases, I was curios why the lambda method time is higher than that of the other parallel version where the body object is self written.

Why is the lambda version always taking more time?
Is it because of the overhead for the compiler to create its own body object?
If the answer for the above question is yes, is the lambda version inferior to the self-written version?

Edit

Below are the results for the optimized code (level -O2)

**1st execution**

Time for serial execution  :0

Time for parallel execution [without lambda]  :0.00055

Time for parallel execution [with lambda]  :1e-05

**2nd execution**

Time for serial execution  :0

Time for parallel execution [without lambda]  :0.000583

Time for parallel execution [with lambda]  :9e-06

**3rd execution**

Time for serial execution  :0

Time for parallel execution [without lambda]  :0.000554

Time for parallel execution [with lambda]  :9e-06

Now the optimized code seem to be showing better results for the serial part and the lamba part time improved.

Does this mean that parallel code performance always need to be tested with optimized code?

What compiler, and what optimization leveldo you use? Are you aware that any decent compiler should replace a call to `squarecalc` with a no-op, as you don't pass the argument by reference, but by value? — MikeMB, Apr 04 '15 at 07:48
The squarecalc is intentionally designed to pass arguments by value as my main objective was to calculate the time and I am least interested in getting the squares. I am using g++ version 4.6.3 and applied no optimizations — sjsam, Apr 04 '15 at 08:08
Sorry, but in that case I can't help you. Arguing about performance of unoptimized code is just pointless. And as I said: if it would be optimized code you would probably only measure overhead anyway (`squarecalc` would e.g. never be called). — MikeMB, Apr 04 '15 at 08:24
@MikeMB : I have recompiled with an optimization level 2 and the results changes dramatically. Please see the edit. — sjsam, Apr 04 '15 at 08:26

score 0 · Accepted Answer · answered Apr 04 '15 at 16:33

Does this mean that parallel code performance always need to be tested with optimized code?

Any code performance has to be tested with optimized code. Do you want to optimize your code for fast runtimes during debugging or for when your software gets actually used?

The main problem in your code is that your loops don't do any work (squarecalc and most probably even serial_apply_square(int* a) get optimized completely away) and the measured times are too short to serve as a indicator for the real life performance of the different constructs.

TBB lambda vs self written body object

1 Answers1