Can idle threads in OpenMP/Cython be used to parallelize remaining section of the work block?

Question

I am new to OpenMP and using it to parallelize a for-loop (to be accurate, I am using prange in Cython).

However, the operations are very uneven, and, as a result, there are quite a few idle threads till one block of the for-loop is completed.

I wanted to know whether there is a way to access the idle threads so that I can use them to parallelize the bottleneck operations.

ead · Answer 1 · 2019-10-16T07:19:36.513

This question boils down to the question of perfect scheduling of tasks, which is quite hard for a general case, so usually one fall back to heuristics.

OpenMP offers different heuristics for scheduling, which can be chosen via schedule-argument to prange (documentation).

Let's look at the following example:

%%cython -c=/openmp --link-args=/openmp
cdef double calc(int n) nogil:
    cdef double d=0.0
    cdef int i
    for i in range(n):
        d+=0.1*i*n
    return d

def single_sum(int n):
    cdef int i
    cdef double sum = 0.0
    for i in range(n):
        sum += calc(i)
    return sum

The evaluation of calc takes O(n), because a IEEE 754 complying compiler is not able to optimize the for-loop.

Now let's replace range through prange:

...
from cython.parallel import prange   
def default_psum(int n):
    cdef int i
    cdef double sum = 0.0
    for i in prange(n, nogil=True, num_threads=2):
        sum += calc(i)
    return sum

I have chosen to limit the number of threads to 2, to make the effect more dramatic. Now, comparing the running times we see:

N=4*10**4
%timeit single_sum(N) #no parallelization
# 991 ms ± 2.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit default_psum(N) #parallelization
# 751 ms ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

not as much improvement as we would like (i.e. we would like speed-up 2)!

It is an implementation detail of OpenMP-provider, which schedule is chosen when it is not explicitly set - but most probably it will be "static" without defining chunksize. In this case, the range is halved and one threads becomes the first, fast half, while another the second, where almost all of the work must be done - so a big part of the work isn't parallelized in the end.

A better strategy to achieve a better balance is to give i=0 to the first thread, i=1 to the second, i=2 again to the first and so on. This can be achieved for "static"-schedule by setting chunksize to 1:

def static_psum1(int n):
    cdef int i
    cdef double sum = 0.0
    for i in prange(n, nogil=True, num_threads=2, schedule="static", chunksize=1):
        sum += calc(i)
    return sum

we almost reach the maximally possible speed-up of 2:

%timeit static_psum1(N)
# 511 ms ± 13.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Choosing best schedule is a trade-off between scheduling overhead (not very high in the example above) and best work-balance - and the best trade-off can only be achieved only after analyzing the problem (and hardware!) at hand.

Here are some timings for the above example for different scheduling strategies and different number of threads:

(schedule,chunksize)        N=2                  N=8
single-threaded            991 ms               991 ms
(default)                  751 ms               265 ms
static                     757 ms               274 ms
static,1                   511 ms               197 ms
static,10                  512 ms               166 ms
dynamic,1                  509 ms               158 ms
dynamic,10                 509 ms               156 ms
guided                     508 ms               158 ms

Trying to use different schedules makes only sense, when there is at least a theoretical possibility to achieve a good balance.

If there is a task, which takes 90% of running time, then no matter which schedule-strategy is used - it will not be possible to improve the performance. In this case the big task itself should be parallelized, sadly Cython's support for OpenMP is somewhat lacking (see for example this SO-post), so possible it is better to code in pure C and then wrap the resulting functionality with Cython.

Thank you very much for your detailed response. I have played a bit with the scheduling, but due to the nature of the algorithm, I will have idle threads. I was hoping to utilize them somehow. — agmnu, Oct 09 '19 at 15:50
For my particular problem, nested parallelism seems to be a way to reduce the time taken by the bottleneck thread. — agmnu, Oct 09 '19 at 15:52
@agmnu I hope you consider to self-answer your question - would be interesting to see the details. — ead, Oct 09 '19 at 16:00
Rather than using nested parallleism, consider using an OpenMP taskloop and creating additional tasks where appropriate. Nested parallelism does *not* re-use idle threads, but rather creates additional ones and often leads to over-subscription. (Of course, the Python interface to OpenMP may not exist for taskloop or task creation :-( ). — Jim Cownie, Oct 10 '19 at 08:35
@Jim Yes! I came across 'tasks' in the openMP book yesterday. I have been considering them since then and your comment gives me more confidence in this direction. Thanks! — agmnu, Oct 10 '19 at 17:36
@Jim Python may not be able to interface with OpenMP, but Cython can wrap around C-functions :D So, the tentative plan is to make a C-function with OpenMP constructs for that computation module and the wrap it with Cython. — agmnu, Oct 10 '19 at 17:37

Can idle threads in OpenMP/Cython be used to parallelize remaining section of the work block?

1 Answers1