OpenAcc program built with C++ compiler is way slower than C built version

Question

The code I'm working on is in C++ and is slightly complicated but the the example below shows the problem. It comes from a book by Chandrasekaran and Juckeland. If it is compiled with nvc -acc (or pgcc -acc, as the authors did) and ran, it takes a few seconds to finish. If I use nvc++ -acc (pgc++ -acc), it is orders of magnitude slower, being even worse than the serial version. I'm curious if anyone noticed a similar issue or knows a possible explanation.

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <sys/time.h>
#define WIDTH 1000
#define HEIGHT 1000
#define TEMP_TOLERANCE 0.01
double Temperature[HEIGHT+2][WIDTH+2];
double Temperature_previous[HEIGHT+2][WIDTH+2];
void initialize();
void track_progress(int iter);

int main(int argc, char *argv[]) {
    int i, j;
    int iteration=1;
    double worst_dt=100;
    struct timeval start_time, stop_time, elapsed_time;
    gettimeofday(&start_time,NULL);
    initialize();

#pragma acc data copy(Temperature_previous), create(Temperature)    
{
    while ( worst_dt > TEMP_TOLERANCE ) {

#pragma acc kernels
        for(i = 1; i <= HEIGHT; i++) {
            for(j = 1; j <= WIDTH; j++) {
                Temperature[i][j] = 0.25 * (Temperature_previous[i+1][j]
                        + Temperature_previous[i-1][j]
                        + Temperature_previous[i][j+1]
                        + Temperature_previous[i][j-1]);
            }
        }
        worst_dt = 0.0;

#pragma acc kernels 
        for(i = 1; i <= HEIGHT; i++){
            for(j = 1; j <= WIDTH; j++){
                worst_dt = fmax( fabs(Temperature[i][j]-
                            Temperature_previous[i][j]),worst_dt);
                Temperature_previous[i][j] = Temperature[i][j];
            }
        }

        if((iteration % 100) == 0) {
#pragma acc update host(Temperature)
            track_progress(iteration);
        }
        iteration++;
    }
}

    gettimeofday(&stop_time,NULL);
    timersub(&stop_time, &start_time, &elapsed_time);
    printf("\nMax error at iteration %d was %f\n",
            iteration-1, worst_dt);
    printf("Total time was %f seconds.\n",
            elapsed_time.tv_sec+elapsed_time.tv_usec/1000000.0);
}
void initialize(){
    int i,j;
    for(i = 0; i <= HEIGHT+1; i++){
        for (j = 0; j <= WIDTH+1; j++){
            Temperature_previous[i][j] = 0.0;
        }
    }
    for(i = 0; i <= HEIGHT+1; i++) {
        Temperature_previous[i][0] = 0.0;
        Temperature_previous[i][WIDTH+1] = (100.0/HEIGHT)*i;
    }
    for(j = 0; j <= WIDTH+1; j++) {
        Temperature_previous[0][j] = 0.0;
        Temperature_previous[HEIGHT+1][j] = (100.0/WIDTH)*j;
    }
}
void track_progress(int iteration) {
    int i;
    printf("---------- Iteration number: %d ------------\n",
            iteration);
    for(i = HEIGHT-5; i <= HEIGHT; i++) {
        printf("[%d,%d]: %5.2f ", i, i, Temperature[i][i]);
    }
    printf("\n");
}

I tested it on two Linux machines with the same result. I tried various compilers, but as long as I was compiling with respect to C++ standards, not C, the problem remained. -Minfo=all doesn't show anything of qualitative significance.

jefflarkin · Accepted Answer · 2023-08-28T19:43:02.683

I'm seeing the same thing and if you add -Minfo=accel into the command line, you'll see that it stops parallelizing the loop at line 37, saying that the value of worst_dt is needed later one, causing a dependency. I don't know why the compiler analysis isn't working correctly here, but if you change line 36 to the following you'll get the performance back: #pragma acc kernels loop independent collapse(2) reduction(max:worst_dt).

Update: You can also try adding -std=c++11 or -std=c++14 to your compilation and get the expected performance without modifying the code. I am not fully sure why the default standard has this issue, but the compiler team has been made aware of this exchange.

score 2 · Answer 2 · answered Aug 28 '23 at 19:53

The problem has to do with how "fmax" is being presented to the compiler. In order to be object compatible, nvc++ uses the g++ STL and system header files. In newer g++ installs when C++17 is the default language level, the std::fmax is getting used so is presented as a function call, rather than the builtin C99 intrinsic. nvc++ needs the C99 version in order to auto-detect the max and implicitly perform the reduction. Otherwise, as the compiler feedback messages show, there's dependency forcing the code to run serially on the device.

You can work around this my setting "--std=c++11" so the C99 fmax is used but adding the explicit reduction as Jeff suggests is probably the better and more portable way to go.

OpenAcc program built with C++ compiler is way slower than C built version

2 Answers2