Recently I've been looking at using TBB rather than boost.threads to speed up development. Generally the parallel_for works in most cases but I have a situation here that is a bit more complex.
There is an array of structs that needs computation that has been sorted according to a member variable. This is because the variables value relates to data that will be accessed during computation and grouping structs according to this will allow for cache coherency in a serial design.
#include <tbb/tbb.h>
#include <iostream>
struct thing
{
float value_one;
float value_two;
unsigned int sort_id;
};
class functor
{
thing* m_array;
public:
functor(thing* _array) : m_array(_array) {;}
void operator()(const tbb::blocked_range<unsigned int>& r) const
{
for(int i = r.begin(); i != r.end(); ++i)
{
//Doing a computation with array
m_array[i].value_one = m_array[i].value_two * m_array[i].value_two;
}
}
};
int main(int argc, char const *argv[])
{
unsigned int n = 10;
thing* array = new thing[n];
// Note the ordered id groups
array[0].sort_id = 1;
array[1].sort_id = 1;
array[2].sort_id = 1;
array[3].sort_id = 2;
array[4].sort_id = 3;
array[5].sort_id = 5;
array[6].sort_id = 5;
array[7].sort_id = 9;
array[8].sort_id = 9;
array[9].sort_id = 9;
// Do something parallel with array here...
// parallel_for(tbb::blocked_range<unsigned int>(0, n, 2), functor(array));
delete[] array;
return 0;
}
A simplified example is given above but in reality I'll most likely have an array of around 30-60 million elements.
I understand that parallel_for will divide the array into grouped ranges. However I would like each range to contain all the structs of a particular id. I don't mind if the range contains structs of multiple ids as long as they are sequential and contain all structs of both those ids.
int count = 0;
thing** blocks = new thing*[7];
int* sizes = new int[7];
int current_id = 0;
for(unsigned int i = 0; i < n; ++i)
{
if(array[i].sort_id != current_id)
{
current_id = array[i].sort_id;
blocks[count] = &array[i];
sizes[count] = 1;
++count;
}
else
{
sizes[count - 1] += 1;
}
}
parallel_for(tbb::blocked_range<unsigned int>(0, count, 2), functor(blocks, sizes));
Should I somehow divide the array into smaller chunks pointed to by another array that is then parallelized (as in the code directly above), and if so, what would be an efficient way to do this or is the example given optimal? Is there an alternative to parallel_for (such as task_group) that would be better suited to this problem?