3

I'm new to OpenMP, but am trying to use it to accelerate some operations on entries of a 2D array with a large number of rows and a small number of columns. At the same time, I am using a reduction to calculate the sum of all the array values in each column. The code looks something like this (I will explain the weird bits in a moment):

template <unsigned int NColumns>
void Function(int n_rows, double** X, double* Y)
{
    #pragma omp parallel for reduction(+:Y[:NColumns])
    for (int r = 0; r < n_rows; ++r)
    {
        for (int c = 0; c < NColumns; ++c)
        {
            X[r][c] = some_complicated_stuff(X[r], X[r][c]);
            Y[c] += X[r][c];
        }
    }
}

To clarify, X is a n_rows x NColumns-sized 2D array allocated on the heap, and Y is a NColumns-sized 1D array. some_complicated_stuff isn't actually implemented as a separate function, but what I do to X[r][c] in that line only depends on X[r][c] and other values in the 1D array X[r].

The reason that NColumns is passed in as a template parameter rather than as a regular argument (like n_rows) is that when NColumns is known at compile-time, the compiler can more aggressively optimise the inner loop in the above function. I know that NColumns is going to be one of a small number of values when the program runs, so later on I have something like this code:

cin >> n_cols;
double** X;
double Y[n_cols];

// initialise X and Y, etc. . .

for (int i = 0; i < n_iterations; ++i)
{
    switch (n_cols)
    {
        case  2: Function< 2>(X, Y); break;
        case 10: Function<10>(X, Y); break;
        case 60: Function<60>(X, Y); break;
        default: throw "Unsupported n_cols."; break;
    }
    // . . .
    Report(i, Y); // see GDB output below
}

Through testing, I have found that having this NColumns "argument" to Update as a template parameter rather than a normal function parameter actually makes for an appreciable performance increase. However, I have also found that, once in a blue moon (say, about every 10^7 calls to Function), the program hangs—and even worse, that its behaviour sometimes changes from one run of the program to the next. This happens rarely enough that I have been having a lot of trouble isolating the bug, but I'm now wondering whether it's because I am using this NColumns template parameter in my OpenMP reduction.

I note that a similar StackOverflow question asks about using template types in reductions, which apparently causes unspecified behaviour - the OpenMP 3.0 spec says

If a variable referenced in a data-sharing attribute clause has a type derived from a template, and there are no other references to that variable in the program, then any behavior related to that variable is unspecified.

In this case, it's not a template type per se that is being used, but I'm sort of in the same ballpark. Have I messed up here, or is the bug more likely to be in some other part of the code?

I am using GCC 6.3.0.

If it is more helpful, here's the real code from inside Function. X is actually a flattened 2D array; ww and min_x are defined elsewhere:

#pragma omp parallel for reduction(+:Y[:NColumns])
for (int i = 0; i < NColumns * n_rows; i += NColumns)
{
    double total = 0;
    for (int c = 0; c < NColumns; ++c)
        if (X[i + c] > 0)
            total += X[i + c] *= ww[c];

    if (total > 0)
        for (int c = 0; c < NColumns; ++c)
            if (X[i + c] > 0)
                Y[c] += X[i + c] = (X[i + c] < min_x * total ? 0 : X[i + c] / total);
}

Just to thicken the plot a bit, I attached gdb to a running process of the program which hanged, and here's what the backtrace shows me:

#0  0x00007fff8f62a136 in __psynch_cvwait () from /usr/lib/system/libsystem_kernel.dylib
#1  0x00007fff8e65b560 in _pthread_cond_wait () from /usr/lib/system/libsystem_pthread.dylib
#2  0x000000010a4caafb in omp_get_num_procs () from /opt/local/lib/libgcc/libgomp.1.dylib
#3  0x000000010a4cad05 in omp_get_num_procs () from /opt/local/lib/libgcc/libgomp.1.dylib
#4  0x000000010a4ca2a7 in omp_in_final () from /opt/local/lib/libgcc/libgomp.1.dylib
#5  0x000000010a31b4e9 in Report(int, double*) ()
#6  0x3030303030323100 in ?? ()
[snipped traces 7-129, which are all ?? ()]
#130 0x0000000000000000 in ?? ()

Report() is a function that gets called inside the program's main loop but not within Function() (I've added it to the middle code snippet above), and Report() does not contain any OpenMP pragmas. Does this illuminate what's happening at all?

Note that the executable changed between when the process started running and when I attached GDB to it, which required referring to the new (changed) executable. So that could mean that the symbol table is messed up.

n.g.davies
  • 31
  • 4
  • Can you please edit your question to specify the compiler you are using and its version? – Sergio Monteleone Apr 24 '18 at 10:46
  • I would strongly prefer to use the old scalar reduction by swapping the for loops, unless there is a good reason (not visible here) for the array reduction. Besides your code pushing the limits of what is likely to be tested adequately, it might consume too much stack. In my tests of much simpler cases, array reduction scaled adequately only to 2 threads, even where it was quite efficient at 1 thread. – tim18 Apr 24 '18 at 11:11
  • @tim18 n_rows is around 10^6 and some_complicated_stuff() makes calculations that are reused for other entries in the same row, but not other entries in the same column, so the existing order of the for loops works best I think. – n.g.davies Apr 24 '18 at 15:12

1 Answers1

0

I have managed to partly work this out.

One of the problems was with the program behaving nondeterministically. This is just because (1) OpenMP performs reductions in thread-completion order, which is non-deterministic, and (2) floating-point addition is non-associative. I assumed that the reductions would be performed in thread-number order, but this is not the case. So any OpenMP for construct that reduces using floating-point operations will be potentially non-deterministic even if the number of threads is the same from one run to the next, so long as the number of threads is greater than 2. Some relevant StackOverflow questions on this matter are here and here.

The other problem was that the program occasionally hangs. I have not been able to resolve this issue. Running gdb on the hung process always yields __psynch_cvwait () at the top of the stack trace. It hangs around every 10^8 executions of the parallelised for loop.

Hope this helps a little.

n.g.davies
  • 31
  • 4