I'm new to OpenMP, but am trying to use it to accelerate some operations on entries of a 2D array with a large number of rows and a small number of columns. At the same time, I am using a reduction to calculate the sum of all the array values in each column. The code looks something like this (I will explain the weird bits in a moment):
template <unsigned int NColumns>
void Function(int n_rows, double** X, double* Y)
{
#pragma omp parallel for reduction(+:Y[:NColumns])
for (int r = 0; r < n_rows; ++r)
{
for (int c = 0; c < NColumns; ++c)
{
X[r][c] = some_complicated_stuff(X[r], X[r][c]);
Y[c] += X[r][c];
}
}
}
To clarify, X is a n_rows
x NColumns
-sized 2D array allocated on the heap, and Y is a NColumns
-sized 1D array. some_complicated_stuff
isn't actually implemented as a separate function, but what I do to X[r][c]
in that line only depends on X[r][c]
and other values in the 1D array X[r]
.
The reason that NColumns
is passed in as a template parameter rather than as a regular argument (like n_rows
) is that when NColumns
is known at compile-time, the compiler can more aggressively optimise the inner loop in the above function. I know that NColumns
is going to be one of a small number of values when the program runs, so later on I have something like this code:
cin >> n_cols;
double** X;
double Y[n_cols];
// initialise X and Y, etc. . .
for (int i = 0; i < n_iterations; ++i)
{
switch (n_cols)
{
case 2: Function< 2>(X, Y); break;
case 10: Function<10>(X, Y); break;
case 60: Function<60>(X, Y); break;
default: throw "Unsupported n_cols."; break;
}
// . . .
Report(i, Y); // see GDB output below
}
Through testing, I have found that having this NColumns
"argument" to Update as a template parameter rather than a normal function parameter actually makes for an appreciable performance increase. However, I have also found that, once in a blue moon (say, about every 10^7 calls to Function
), the program hangs—and even worse, that its behaviour sometimes changes from one run of the program to the next. This happens rarely enough that I have been having a lot of trouble isolating the bug, but I'm now wondering whether it's because I am using this NColumns
template parameter in my OpenMP reduction.
I note that a similar StackOverflow question asks about using template types in reductions, which apparently causes unspecified behaviour - the OpenMP 3.0 spec says
If a variable referenced in a data-sharing attribute clause has a type derived from a template, and there are no other references to that variable in the program, then any behavior related to that variable is unspecified.
In this case, it's not a template type per se that is being used, but I'm sort of in the same ballpark. Have I messed up here, or is the bug more likely to be in some other part of the code?
I am using GCC 6.3.0.
If it is more helpful, here's the real code from inside Function
. X
is actually a flattened 2D array; ww
and min_x
are defined elsewhere:
#pragma omp parallel for reduction(+:Y[:NColumns])
for (int i = 0; i < NColumns * n_rows; i += NColumns)
{
double total = 0;
for (int c = 0; c < NColumns; ++c)
if (X[i + c] > 0)
total += X[i + c] *= ww[c];
if (total > 0)
for (int c = 0; c < NColumns; ++c)
if (X[i + c] > 0)
Y[c] += X[i + c] = (X[i + c] < min_x * total ? 0 : X[i + c] / total);
}
Just to thicken the plot a bit, I attached gdb to a running process of the program which hanged, and here's what the backtrace shows me:
#0 0x00007fff8f62a136 in __psynch_cvwait () from /usr/lib/system/libsystem_kernel.dylib
#1 0x00007fff8e65b560 in _pthread_cond_wait () from /usr/lib/system/libsystem_pthread.dylib
#2 0x000000010a4caafb in omp_get_num_procs () from /opt/local/lib/libgcc/libgomp.1.dylib
#3 0x000000010a4cad05 in omp_get_num_procs () from /opt/local/lib/libgcc/libgomp.1.dylib
#4 0x000000010a4ca2a7 in omp_in_final () from /opt/local/lib/libgcc/libgomp.1.dylib
#5 0x000000010a31b4e9 in Report(int, double*) ()
#6 0x3030303030323100 in ?? ()
[snipped traces 7-129, which are all ?? ()]
#130 0x0000000000000000 in ?? ()
Report()
is a function that gets called inside the program's main loop but not within Function()
(I've added it to the middle code snippet above), and Report()
does not contain any OpenMP pragmas. Does this illuminate what's happening at all?
Note that the executable changed between when the process started running and when I attached GDB to it, which required referring to the new (changed) executable. So that could mean that the symbol table is messed up.