nvprof produces unexpected branch efficiency results

Question

I followed the examples (the following codes) of warp divergence on the textbook "Professional CUDA C Programming".

__global__ void math_kernel1(float *c) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    float a, b;
    a = b = 0.f;
    if (tid % 2 == 0) {
        a = 100.0f;
    } else {
        b = 200.0f;
    }
    c[tid] = a + b;
}

__global__ void math_kernel2(float *c) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    float a, b;
    a = b = 0.f;
    if ((tid / warpSize) % 2 == 0) {
        a = 100.0f;
    } else {
        b = 200.0f;
    }
    c[tid] = a + b;
}

__global__ void math_kernel3(float *c) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    float a, b;
    bool ipred = (tid % 2 == 0);
    a = b = 0.f;
    if (ipred) {
        a = 100.0f;
    }
    if (!ipred) {
        b = 200.0f;
    }

    c[tid] = a + b;
}

Obviously (and written on the textbook), the math_kernel2 should have best branch efficiency, math_kernel1 follows and math_kernel3 has the worst result. However, the nvprof report gives me results that are contradictory to the textbook. I benchmarked these kernels on GTX 1080 Ti with CUDA 8.0 (I also added the compiler flags -g -G to nvcc to disable optimizations) and it reports the following branch efficiency:

math_kernel1 83.33%
math_kernel2 100.00%
math_kernel3 100.00% (expected to be less than math_kernel1 and it is 71.43% on the textbook)

It appears that the only real question is about the placement of kernel3. The kernel3 achieves no loss of branch efficiency, because the compiler realizes the conditional code using predication, rather than branching. SInce there is no branching, there is no loss of efficiency due to branching, for the conditional portion of the code. If the question is really "why does the book say what it does" then I wouldn't be able to answer that. Perhaps you are misreading the book, or your test case doesn't match what is in the book (e.g. compiler version, cuda version, compile command line, etc.) — Robert Crovella, Mar 23 '19 at 16:26
@RobertCrovella Totally understood that the part that the compiler have optimized these codes and sorry for my vague description. I am not intending to reimplements the results on the book but since we have differences I am curious to figure out what happens here. And I still do not understand the difference between "prediction" vs "branch". Does "branch" mean we have else-clause while "prediction" does not have? Thanks. — Kipsora Lawrence, Mar 24 '19 at 01:31
It's predication, not prediction. A boolean predicate is created, and subsequent instructions are executed or not based on the predicate. You can read more about it in [the PTX manual](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#instruction-statements) (and search for references to "predicate"), and it applies at the machine code level, not at the C source code level. So the compiler may replace the if statement with a constuct based on predication, and no branching occurs in this case. — Robert Crovella, Mar 24 '19 at 01:37
If you [search](https://stackoverflow.com/search?q=%5Bcuda%5D+predication) here on the `cuda` tag, you will find many questions discussing predication. — Robert Crovella, Mar 24 '19 at 01:38

nvprof produces unexpected branch efficiency results

0 Answers0