I was experimenting with the below piece of code to compare
the performance of serial and parallel for (both non-lambda and lambda).
#include<iostream>
#include<chrono>
#include <ctime>
#include<fstream>
#include<stdlib.h>
#define MAX 10000000
#include "tbb/tbb.h"
#include "tbb/task_scheduler_init.h"
using namespace std;
using namespace tbb;
void squarecalc(int a)
{
a *= a;
}
void serial_apply_square(int* a)
{
for (int i = 0; i<MAX; i++)
squarecalc(*(a + i));
}
class apply_square
{
int* my_a;
public:
void operator()(const blocked_range<size_t>& r) const
{
int *a = my_a;
for (size_t i = r.begin(); i != r.end(); ++i)
squarecalc(a[i]);
}
apply_square(int* a) :my_a(a){}
};
void parallel_apply_square(int* a, size_t n)
{
parallel_for(blocked_range<size_t>(0, n), apply_square(a));
}
void parallel_apply_square_lambda(int* a, size_t n)
{
parallel_for(blocked_range<size_t>(0, n), [=](const blocked_range<size_t>& r)
{
for (size_t i = r.begin(); i != r.end(); ++i)
squarecalc(a[i]);
}
);
}
int main()
{
std::chrono::time_point<std::chrono::system_clock> start, end;
int i = 0;
int* a = new int[MAX];
fstream of;
of.open("newfile", ios::in);
while (i<MAX)
{
of >> a[i];
i++;
}
start = std::chrono::system_clock::now();
serial_apply_square(a);
end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end - start;
cout << "\nTime for serial execution :" << elapsed_seconds.count() << endl;
start = std::chrono::system_clock::now();
parallel_apply_square(a, MAX);
end = std::chrono::system_clock::now();
elapsed_seconds = end - start;
cout << "\nTime for parallel execution [without lambda] :" << elapsed_seconds.count() << endl;
start = std::chrono::system_clock::now();
parallel_apply_square_lambda(a, MAX);
end = std::chrono::system_clock::now();
elapsed_seconds = end - start;
cout << "\nTime for parallel execution [with lambda] :" << elapsed_seconds.count() << endl;
free(a);
}
In short it just computes the square of 10000000 numbers in both serial and parallel ways. Below is the output that I got for multiple executions of the object code.
**1st execution**
Time for serial execution :0.043183
Time for parallel execution [without lambda] :0.035238
Time for parallel execution [with lambda] :0.036719
**2nd execution**
Time for serial execution :0.043252
Time for parallel execution [without lambda] :0.035403
Time for parallel execution [with lambda] :0.036811
**3rd execution**
Time for serial execution :0.043241
Time for parallel execution [without lambda] :0.035355
Time for parallel execution [with lambda] :0.036558
**4th execution**
Time for serial execution :0.043216
Time for parallel execution [without lambda] :0.035491
Time for parallel execution [with lambda] :0.036697
Thought the parallel execution times is lesser than the serial execution
times for all the cases, I was curios why the lambda method time is higher
than that of the other parallel version where the body object is self written.
- Why is the lambda version always taking more time?
- Is it because of the overhead for the compiler to create its own body object?
- If the answer for the above question is yes, is the lambda version inferior to the self-written version?
Edit
Below are the results for the optimized code (level -O2)
**1st execution**
Time for serial execution :0
Time for parallel execution [without lambda] :0.00055
Time for parallel execution [with lambda] :1e-05
**2nd execution**
Time for serial execution :0
Time for parallel execution [without lambda] :0.000583
Time for parallel execution [with lambda] :9e-06
**3rd execution**
Time for serial execution :0
Time for parallel execution [without lambda] :0.000554
Time for parallel execution [with lambda] :9e-06
Now the optimized code seem to be showing better results for the serial part and the lamba part time improved.
Does this mean that parallel code performance always need to be tested with optimized code?