Higher core load Intel TBB

Question

I am using Intel TBB parallel_for to speed up a for loop doing some calculations:

tbb::parallel_for(tbb::blocked_range<int>(0,ListSize,1000),Calc);

Calc is an object of the class doCalc

class DoCalc
{
vector<string>FileList;
public:
    void operator()(const tbb::blocked_range<int>& range) const{
    for(int i=range.begin(); i!=range.end();++i){
    //Do some calculations
    }
    }
    DoCalc(vector<string> ilist):FileList(ilist){}
};

It takes approx. 60 seconds when I use the standard serial form of the for loop and approx. 20 seconds when I use the parallel_for from TBB to get the job done. When using standard for, the load of each core of my i5 CPU is at approx. 15% (according windows task manager) and very inhomogeneous and at approx. 50% and very homogeneous when using parallel_for.

I wonder if it's possible to get an even higher core load when using parallel_for. Are there any other parameters except grain_size? How can I boost the speed of parallel_for without changing the operations within the for loop (here as //Do some calculations in the code sample above).

score 1 · Answer 1 · answered Dec 12 '12 at 16:57

The grainsize parameter is optional. When grainsizee is not specified, a partitioner should be supplied to the algorithm template. A partitioner is an object that guides the chunking of a range. The auto_partitioner provides an alternative that heuristically chooses the grain size so that you do not have to specify one. The heuristic attempt to limit overhead while still providing ample opportunities for load balancing.

Go to the tbb website for more information. www.threadingbuildingblocks.org

score 0 · Answer 2 · answered Feb 14 '13 at 14:13

The answer to your question also depends on the ratio between memory accesses and computation in your algorithm. If you do very few operations on a lot of data, your problem is memory bound and that will limit the core load. If on the other hand you compute a lot with little data, your chances of improving are better.

score 0 · Accepted Answer · answered Feb 22 '13 at 23:48

As @Eugene Roader already suggested, you might want to use the auto_partitioner (which is default from TBB version 2.2) to automatic chuncking of the range:

tbb::parallel_for(tbb::blocked_range<int>(0,ListSize),Calc,tbb:auto_partitioner());

I assume that your i5-CPU has 4 cores, so you get a speedup of 3 (60s => 20s) which is already "quite nice" as there might be certain overheads in the parallelization. One problem could be the maximum limit of memory bandwidth of you CPU which is saturated with 3 threads - or you might have a lot of allocation/deallocations within your could which are/must be synchronized between the threads with the standard memory manager. One trick to tackle this problem without much code changes in the inner loop might be using a thread local allocator, e.g. for FileList:

vector<string,tbb:scalable_allocator<string>> FileList;

Note that you should try the tbb::scalable_allocator for all other containers used in the loop too, in order bring your parallelization speedup closer to the number of cores, 4.

Higher core load Intel TBB

3 Answers3