Which loops should I parallelize, the outer or the inner ones

Question

I am writing an image processing filter, and I want to speed up the computations using openmp. My pseudo-code structure follows like this:

for(every pixel in the image){
    //do some stuff here
    for(any combination of parameters){
        //do other stuff here and filter
    }
}

The code is filtering every pixel using different parameters, and choosing the optimal ones.

My question is what is faster: to parallelize the first loop among the processors, or to access sequentially the pixels and parallelize the different parameters selection.

I think the question could be a more general one: what is faster, giving big amounts of operations to every thread, or creating many threads with few operations.

I don't care for now about the implementation details, and I think I can handle them with my previous expertise using openmp. Thanks!

The answer to "which is faster?" is always the same. **Try it both ways and then you'll know**. We can't see the future any better than you can. — Eric Lippert, Feb 18 '14 at 19:21
Thanks @EricLippert, that makes sense. But my algorithms are quite complex, and the images I handle very big, so I thought that asking my specific question, and after that presenting it as a more "general" problem, could be time-saving for me, and useful for others — Anthony, Feb 25 '14 at 10:42
By the way, if you use thread pools with existing threads in it, the thread-creation overhead will not matter that much. And this should be the case if you program is a generally multithreaded one already. — Erik Kaplun, Feb 25 '14 at 13:33
@Anthony: If your real problem is complex and difficult that is all the more reason to expect that strangers on the internet are going to guess wrong. — Eric Lippert, Feb 25 '14 at 14:34

drolando · Answer 1 · 2014-02-18T14:15:59.283

4

what is faster, giving big amounts of operations to every thread, or creating many threads with few operations

Creating a new thread requires a lot of time and resources so it's better to create few threads with longer tasks.

It also depends on your algorithm: if you access the disk/memory too often the threads will be suspended frequently so it would be better to use a few more threads.

edited Feb 18 '14 at 14:15

answered Feb 18 '14 at 14:01

drolando

507
2
7

1

the best is even to create not to much but exactly as many as required with exact number of tasks – 4pie0 Feb 18 '14 at 14:02
Thread pools exist to allow overcoming the overhead of creating new threads. OpenMP already does that internally for all major implementations. – Everyone Apr 27 '22 at 09:17

score 4 · Answer 2 · answered Feb 18 '14 at 14:02

4

There tends to be substantial overheard in thread creation and scheduling. In general you want to give each thread enough work that the overhead from creating a new thread is a absorbed by the "win" of introducing multithreading.

Additionally, assuming you have sufficiently many pixels, it's a good idea to make sure each thread accesses pixels sequentially. Better for caching at the OS level, and ensuring that the data is where you want it to be already. Loading from memory repeatedly will hurt your parallelization win, too.

answered Feb 18 '14 at 14:02

Joan Smith

931
6
15

"overhead of creating a new thread" is inaccurate in this context. Parallelizing an inner loop will _*not*_ create new threads every iteration of the outer one, it will reuse the threads created by the first time: https://stackoverflow.com/a/24756350/6871623 How much does the inner loop assignments overhead cost is another question, though. – Everyone Apr 27 '22 at 09:16

Eamonn McEvoy · Accepted Answer · 2014-02-18T15:10:41.263

4

Your goal is to distribute the data evenly over the available processors. You should split the image up (outer loop) evenly with one thread per processor core. Experiment with fine and coarse grain parallelism to see what gives the best results. Once your number of threads exceed the number of cores available you will start to see performance degradation.

edited Feb 18 '14 at 15:10

answered Feb 18 '14 at 14:04

Eamonn McEvoy

8,876
14
53
83

1

This can help cache locality too - e.g. 4 CPUs with 25% of pixels in each CPUs cache vs. 4 CPUs with 100% of pixels in each CPUs cache – Brendan Feb 18 '14 at 14:20

Which loops should I parallelize, the outer or the inner ones

3 Answers3