allocating memory per thread in a parallel_for loop

Question

I originally have a single-threaded loop which iterates over all pixels of an image and may do various operation with the data.

The library I am using dictates that retrieving pixels from an image must be done one line at a time. To this end I malloc a block of memory which can host one row of pixels (BMM_Color_fl is a struct containing one pixel's RGBA data as four float values, and GetLinearPixels() copies one row of pixels from a bitmap into a BMM_Color_fl array.)

BMM_Color_fl* line = (BMM_Color_fl*)malloc(width * sizeof(BMM_Color_fl));
for (int y = 0; y < height, y++)
{   
    bmp->GetLinearPixels(0, y, width, line); //Copy data of row Y from bitmap into line.
    BMM_Color_fl* pixel = line; //Get first pixel of line.
    for (int x = 0; x < width; x++, pixel++) // For each pixel in the row...
    {
        //Do stuff with a pixel.
    }
}
free(line);

So far so good!

For the sake of reducing execution time of this loop, I have written a concurrent version using parallel_for, which looks like this:

parallel_for(0, height, [&](int y)
{   
    BMM_Color_fl* line = (BMM_Color_fl*)malloc(width * sizeof(BMM_Color_fl));
    bmp->GetLinearPixels(0, y, width, line);
    BMM_Color_fl* pixel = line;
    for (int x = 0; x < width; x++, pixel++)
    {
        //Do stuff with a pixel.
    }
    free(line);
});

While the multithreaded loop is already faster than the original, I realize it is impossible for all threads to use the same memory block, so currently I am allocating and freeing the memory at each loop iteration, which is obviously wasteful as there will never be more threads than loop iterations.

My question is if and how can I have each thread malloc exactly one line buffer and use it repeatedly (and ideally, free it at the end)?

As a disclaimer I must state I am a novice C++ user.

Implementation of suggested solutions:

Concurrency::combinable<std::vector<BMM_Color_fl>> line;

parallel_for(0, height, [&] (int y)
{
    std::vector<BMM_Color_fl> lineL = line.local();
    if (lineL.capacity() < width) lineL.reserve(width);

    bmp->GetLinearPixels(0, y, width, &lineL[0]);

    for (int x = 0; x < width; x++)
    {
         BMM_Color_fl* pixel = &lineL[x];
         //Do stuff with a pixel.
    }       
});

As suggested, I canned the malloc and replaced it with a vector+reserve.

Sorry but I can’t *not* complain: `malloc` has no place whatsoever in C++ code. Under no circumstance ever, unless you are in fact implementing a custom allocator in terms of `malloc` (but why you would ever do that is beyond me since the default allocator does that already). — Konrad Rudolph, Feb 11 '12 at 15:56
Because C++ specifically encapsulates the `malloc` / `free` mechanism in two different ways: (1) via `new`/`delete`, and (2) via allocators. Both offer a superior interface due to strong typing. This improves readability and avoids type errors. Furthermore, even those methods should generally (read: almost *always*) be avoided in favour of managed memory (in your case, a `std::vector` but more generally smart pointers) in order to avoid memory leaks and to simplify the code (no explicit clean-up needed). — Konrad Rudolph, Feb 11 '12 at 16:00
@Konrad As 3 out of the 4 methods you mentioned are unknown to me, I see I have some reading to do. My logic for not using `new/delete` was that the values would be initialized and I was trying to avoid that overhead as I'm overwriting them anyways. Is this logic mistaken? — Rotem, Feb 11 '12 at 16:02
You are indeed correct. In this case, an [allocator](http://www.cplusplus.com/reference/std/memory/allocator/) might help. But using a `std::vector` in conjunction with `reserve` is much less hassle and also doesn’t initialise the values. — Konrad Rudolph, Feb 11 '12 at 16:35
@Konrad Thanks, I will definitely look into that, though as far as I can tell it still doesn't go towards solving the concurrency problem, right? Even when I use a `new` array and try to capture it inside the lambda by value (`[=]`) it doesn't work. — Rotem, Feb 11 '12 at 16:40
What platform? Windows? Linux? Whose implementation of parallel_for? Since the curly braces imply scope, can you not use a smart pointer class and let the smart pointer worry about deleting the memory? — JimR, Feb 11 '12 at 18:38
@JimR - (1) Windows (2) Not sure what you mean, concurrency:parallel_for (3) I'm not familiar with smart pointers but I'll be sure to read up on it, thanks. — Rotem, Feb 12 '12 at 16:57
@Konrad Thanks for all the advice, I've refactored it to use `std::vector` instead of `malloc`, added revised code in question. — Rotem, Feb 12 '12 at 16:58

score 0 · Answer 1 · answered Feb 11 '12 at 15:43

0

Instead of having each thread call parallel_for() have them call another function which allocates the memory, calls parallel_for(), and then frees the memory.

answered Feb 11 '12 at 15:43

salsaman

956
6
3

I'm not sure what you mean. I am not manually starting any threads, I am calling parallel_for from my main thread, which automatically splits the loop iterations into separate threads. – Rotem Feb 11 '12 at 15:55
Yes but as you said, you are doing the alloction inside the for loop. You want to put the malloc/free outside the for loop. You can do this by creating another function which mallocs, calls parallel_for() and the frees the memory. Then call this for each thread. – salsaman Feb 11 '12 at 16:30
Sorry, I'm still not following. You're suggesting I create a bunch of new threads and have each of them call allocation/parallel_for/free? Wouldn't that cause each thread to spawn more threads inside the parallel_for? And wouldn't that still not work because each parallel_for would be wrongly using a single memory block for all threads? – Rotem Feb 11 '12 at 16:36
What I mean is - you can do this by creating another function which mallocs, calls an equivalent of parallel_for() and the frees the memory. Then call this for each thread, ie. do your own thread management rather than use parallel_for directly. – salsaman Feb 11 '12 at 16:37
So basically your original for loop in a wrapper. – salsaman Feb 11 '12 at 16:38
Ah I understand what you mean now. Yes, that is definitely the 'plan b' in this scenario but I was hoping there was a way to capture a copy of the array/memory block inside the lambda, and have parallel_for do the managing, as it does automatic load balancing. – Rotem Feb 11 '12 at 16:43

score 0 · Accepted Answer · answered Feb 11 '12 at 19:07

0

You can use Concurrency::combinable class to achieve this. I am lazy to post the code, but I am sure it is possible.

answered Feb 11 '12 at 19:07

Ajay

18,086
12
59
105

Thanks for the pointer. If I understand you correctly, I should be creating a `combinable` object of `width` length and calling it's `.local()` inside the loop? And if so, would that be creating a copy of the array per thread or per loop iteration? – Rotem Feb 12 '12 at 11:31
nevermind, figured it out. Added solution code at the end of the question. – Rotem Feb 12 '12 at 16:36

allocating memory per thread in a parallel_for loop

2 Answers2