OpenMP: splitting loop based on NUMA

Question

I am running the following loop using, say, 8 OpenMP threads:

float* data;
int n;

#pragma omp parallel for schedule(dynamic, 1) default(none) shared(data, n)
for ( int i = 0; i < n; ++i )
{
    DO SOMETHING WITH data[i]
}

Due to NUMA, I'd like to run first half of the loop (i = 0, ..., n/2-1) with threads 0,1,2,3 and second half (i = n/2, ..., n-1) with threads 4,5,6,7.

Essentially, I want to run two loops in parallel, each loop using a separate group of OpenMP threads.

How do I achieve this with OpenMP?

Thank you

PS: Ideally, if threads from one group are done with their half of the loop, and the other half of the loop is still not done, I'd like threads from finished group join unsfinished group processing the other half of the loop.

I am thinking about something like below, but I wonder if I can do this with OpenMP and no extra book-keeping:

int n;
int i0 = 0;
int i1 = n / 2;

#pragma omp parallel for schedule(dynamic, 1) default(none) shared(data,n,i0,i1)
for ( int i = 0; i < n; ++i )
{
    int nt = omp_get_thread_num();
    int j;
    #pragma omp critical
    {
        if ( nt < 4 ) {
            if ( i0 < n / 2 ) j = i0++; // First 4 threads process first half
            else              j = i1++; // of loop unless first half is finished
        }
        else {
            if ( i1 < n ) j = i1++;  // Second 4 threads process second half
            else          j = i0++;  // of loop unless second half is finished 
        }
    }

    DO SOMETHING WITH data[j]
}

Can you explain why you say "Due to NUMA, I'd like to run first half of the loop (i = 0, ..., n/2-1) with threads 0,1,2,3 and second half (i = n/2, ..., n-1) with threads 4,5,6,7."? — Z boson, Jul 25 '14 at 14:31
Because `data` is allocated in such way, that first half of it is close to one socket (where I run threads 0,1,2,3) and second half of it is close to another socket (where I run threads 4,5,6,7) — user2052436, Jul 25 '14 at 14:33
What is your OS and hardware and compiler? Linux? Two sockets Intel Xeon? Gcc? — Z boson, Jul 25 '14 at 14:34
@Zboson RHEL 6.3, 8-socket Xeon CPU E5-4640 (64 cores total). 1 TB memory. Example in the post is simplified. I need more than 2 groups of threads. Compiler: GCC 4.8.3 or latest Intel. — user2052436, Jul 25 '14 at 14:38
Are you sure you want `schedule(dynamic,1)` or do you want `sechedule(static)`? — Z boson, Jul 25 '14 at 14:43
dynamic, because execution time for different `i` can be different. And block size is `1`, because of cache misses (I don't want to go into details of DO SOMETHING part of the code) — user2052436, Jul 25 '14 at 14:50
You can use numactl on the command line or set_schedaffinity to pin threads to particular cores, but the nicer approach is to make sure your data is in the right place for the threads by initializing the data with OMP threads, to make sure the data and the threads are collocated. — Jonathan Dursi, Jul 25 '14 at 15:00
Threads are pinned, data is allocated close to threads. That's not an issue. The issue is loop splitting code. — user2052436, Jul 25 '14 at 15:02
In my experience in Linux the threads are scattered. So thread 0 would got to socket 0, thread 1 to socket 1, and so forth (windows uses compact). The way you describe appears to be the compact form. Are you sure that's the correct topology? — Z boson, Jul 25 '14 at 15:09
I set, KMP_AFFINITY appropriately. So threads run where I tell them to run. — user2052436, Jul 25 '14 at 15:11

Jonathan Dursi · Accepted Answer · 2014-07-25T16:21:26.937

5

Probably best is to use nested parallelization, first over NUMA nodes, then within each node; then you can use the infrastructure for dynamic while still breaking the data up amongst thread groups:

#include <omp.h>
#include <stdio.h>

int main(int argc, char **argv) {

    const int ngroups=2;
    const int npergroup=4;
    const int ndata = 16;

    omp_set_nested(1);
    #pragma omp parallel for num_threads(ngroups)
    for (int i=0; i<ngroups; i++) {
        int start = (ndata*i+(ngroups-1))/ngroups;
        int end  = (ndata*(i+1)+(ngroups-1))/ngroups;    

        #pragma omp parallel for num_threads(npergroup) shared(i, start, end) schedule(dynamic,1)
        for (int j=start; j<end; j++) {
            printf("Thread %d from group %d working on data %d\n", omp_get_thread_num(), i, j);
        }
    }

    return 0;
}

Running this gives

$ gcc -fopenmp -o nested nested.c -Wall -O -std=c99
$ ./nested | sort -n -k 9
Thread 0 from group 0 working on data 0
Thread 3 from group 0 working on data 1
Thread 1 from group 0 working on data 2
Thread 2 from group 0 working on data 3
Thread 1 from group 0 working on data 4
Thread 3 from group 0 working on data 5
Thread 3 from group 0 working on data 6
Thread 0 from group 0 working on data 7
Thread 0 from group 1 working on data 8
Thread 3 from group 1 working on data 9
Thread 2 from group 1 working on data 10
Thread 1 from group 1 working on data 11
Thread 0 from group 1 working on data 12
Thread 0 from group 1 working on data 13
Thread 2 from group 1 working on data 14
Thread 0 from group 1 working on data 15

But note that the nested approach may well change the thread assignments over what the one-level threading would be, so you will probably have to play with KMP_AFFINITY or other mechanisms a bit more to get the bindings right again.

edited Jul 25 '14 at 16:21

answered Jul 25 '14 at 15:16

Jonathan Dursi

50,107
9
127
158

That's a clever answer. I have not used `omp_set_nested` yet. – Z boson Jul 25 '14 at 15:20
Thanks - once I finally understood the question, it mapped nicely onto this. – Jonathan Dursi Jul 25 '14 at 15:42
Thanks. I guess you could also use tasks in outer loop. Don't know if it makes a difference. I am also trying to understand omp teams construct (never used them before). Can this feature be used instead of nested parallelism? – user2052436 Jul 25 '14 at 15:46
Tasks or parallel for at the top level, it doesn't really matter - whatever makes it easier to read or write. The nice thing about the loop is that it generalizes easily to different number of top level numa nodes. Teams do refer to nested parallelism, although be careful - in OMP 4, teams refers to the accelerator (GPU/Phi) stuff. – Jonathan Dursi Jul 25 '14 at 15:52
2

@JonathanDursi I just implemeted your approach in my code and tested it on a 2-socket (16-core total) machine. With old code, execution time dropped from 55 seconds to 50 when going from 8 threads to 16. With NUMA-aware code, 16-thread test runs in 32 seconds! – user2052436 Jul 25 '14 at 20:10
The thread topology for the OP's question is compact. How could this be done if the thread topology is scattered (which is the default for Linux)? For example with eight threads the even threads (0,2,4,6) would map to one socket and the odd threads (1,3,5,7) to another socket. It seems that the scattered topology is not very useful for this? – Z boson Aug 19 '14 at 12:15

OpenMP: splitting loop based on NUMA

1 Answers1