I followed the examples (the following codes) of warp divergence on the textbook "Professional CUDA C Programming".
__global__ void math_kernel1(float *c) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
float a, b;
a = b = 0.f;
if (tid % 2 == 0) {
a = 100.0f;
} else {
b = 200.0f;
}
c[tid] = a + b;
}
__global__ void math_kernel2(float *c) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
float a, b;
a = b = 0.f;
if ((tid / warpSize) % 2 == 0) {
a = 100.0f;
} else {
b = 200.0f;
}
c[tid] = a + b;
}
__global__ void math_kernel3(float *c) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
float a, b;
bool ipred = (tid % 2 == 0);
a = b = 0.f;
if (ipred) {
a = 100.0f;
}
if (!ipred) {
b = 200.0f;
}
c[tid] = a + b;
}
Obviously (and written on the textbook), the math_kernel2
should have best branch efficiency, math_kernel1
follows and math_kernel3
has the worst result. However, the nvprof
report gives me results that are contradictory to the textbook. I benchmarked these kernels on GTX 1080 Ti with CUDA 8.0 (I also added the compiler flags -g -G
to nvcc
to disable optimizations) and it reports the following branch efficiency:
- math_kernel1 83.33%
- math_kernel2 100.00%
- math_kernel3 100.00% (expected to be less than math_kernel1 and it is 71.43% on the textbook)