Edit: I've filed this as a bug at https://developer.nvidia.com/nvidia_bug/3711214.
I'm writing a numerical simulation program that is giving subtly-incorrect results in Release mode, but seemingly correct results in Debug mode. The original program used curand for random sampling, but I've reduced it to a much simpler and more deterministic MVCE which launches a single kernel of 1 block * 1 warp (of 32 threads), where each thread:
- Performs a computation with a loop that will likely become warp-divergent, especially near the end as some threads complete their task before others.
- Syncs the threads back together.
- Attempts to butterfly-shuffle data with fellow threads in the warp to obtain a single sum.
- [not needed in the MVCE] thread 0 would write the sum back to global memory so it can be copied to the host
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
__global__ void test_kernel()
{
int cSteps = 0;
int cIters = 0;
float pos = 0;
//curandState localState = state[threadIdx.x];
while (true) {
float rn = threadIdx.x * 0.01 + 0.001;
pos += rn;
cSteps++;
if (pos > 1.0f) {
pos = 0;
cIters++;
if (cSteps > 1024) {
break;
}
}
}
printf(" 0: Th %d cI %d\n", threadIdx.x, cIters);
__syncthreads();
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 1, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 1, 32);
printf(" 1: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 2, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 2, 32);
printf(" 2: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 4, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 4, 32);
printf(" 4: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 8, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 8, 32);
printf(" 8: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 16, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 16, 32);
printf("16: Th %2d cI %d\n", threadIdx.x, cIters);
}
int main()
{
test_kernel <<<1, 32>>> ();
return 0;
}
In debug mode, the shuffle works as expected. I see each thread start out with its own value:
0: Th 0 cI 2
0: Th 1 cI 12
0: Th 2 cI 22
0: Th 3 cI 32
0: Th 4 cI 41
// ...
after the first shuffle xor 1, each pair of threads agrees on the same number:
1: Th 0 cI 14
1: Th 1 cI 14
1: Th 2 cI 54
1: Th 3 cI 54
after the shuffle xor 2, each group of four threads agrees:
2: Th 0 cI 68
2: Th 1 cI 68
2: Th 2 cI 68
2: Th 3 cI 68
2: Th 4 cI 223
2: Th 5 cI 223
2: Th 6 cI 223
2: Th 7 cI 223
and so on. After the last shuffle, all threads in the warp agree on the same value (4673).
As soon as I enable Release mode, I get results that are subtly garbage. The values entering the shuffle are the same, and the values after the first round of the shuffle agree with the debug build (and agree within each pair as before). As soon as I do a shuffle xor 2, the results fall apart:
2: Th 0 cI 28
2: Th 1 cI 28
2: Th 2 cI 108
2: Th 3 cI 108
2: Th 4 cI 186
2: Th 5 cI 186
2: Th 6 cI 260
2: Th 7 cI 260
In fact, this is the exact output that a debug build (and hand inspection) would produce if the shuffle sequence were replaced by this specific broken one:
printf(" 0: Th %d cI %d\n", threadIdx.x, cIters);
__syncthreads();
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 1, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 1, 32);
printf(" 1: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 1, 32); // 2 changed to 1
cIters += __shfl_xor_sync(0xffffffff, cIters, 1, 32); // 2 changed to 1
printf(" 2: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 4, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 4, 32);
printf(" 4: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 8, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 8, 32);
printf(" 8: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 16, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 16, 32);
The full diff of the output is here.
Hardware and software environment is as follows:
GA103 3080Ti (mobile), at manufacturer-recommended clocks, 16 G VRAM. Machine doesn't seem to be having corruption with other Cuda programs (tested with primegrid-CUDA and tasks verified against double-checks)
CUDA 11.0
MVSC host compiler 14.29.30133
Full debug command line as follows:
"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\nvcc.exe" -gencode=arch=compute_52,code=\"sm_52,compute_52\" --use-local-env -ccbin "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64" -x cu -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -G --keep-dir x64\Debug -maxrregcount=0 --machine 64 --compile -cudart static -g -DWIN32 -DWIN64 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Fdx64\Debug\vc142.pdb /FS /Zi /RTC1 /MDd " -o x64\Debug\kernel.cu.obj "C:\Users\[username]\source\repos\BugRepro\BugRepro\kernel.cu"
Full release command line as follows:
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\nvcc.exe" -gencode=arch=compute_52,code=\"sm_52,compute_52\" --use-local-env -ccbin "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64" -x cu -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" --keep-dir x64\Release -maxrregcount=0 --machine 64 --compile -cudart static -DWIN32 -DWIN64 -DNDEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /O2 /Fdx64\Release\vc142.pdb /FS /Zi /MD " -o x64\Release\kernel.cu.obj "C:\Users\[username]\source\repos\BugRepro\BugRepro\kernel.cu"
Things I tried without resolution:
- Adding/removing syncthreads calls (where one is shown, and between shuffle calls), even though they shouldn't be necessary since each shuffle synchronizes
- Changing the compute capability to 8.0 to better match my card
- Forcing base clocks on the GPU
- Shuffling in the opposite order (16/8/4/2/1)
- Using __shfl_down_sync instead of xor, with the same pattern of offsets.
Having each thread write to global memory and then summing on the host CPU does produce correct results.
Replacing all the shuffles with calls to __shfl_sync
and manually-calculated lane IDs works. Replacing just the broken shuffle xor 2 with a __shfl_sync
doesn't. Replacing just the first shuffle xor 1 (which worked correctly) with a __shfl_sync
does seem to fix it. (These two workarounds apply to my MVCE; I have not had a chance to evaluate whether they apply to the full program)
// unexpectedly working
int id = threadIdx.x;
printf(" 0: Th %d cI %d\n", threadIdx.x, cIters);
__syncthreads();
cSteps += __shfl_sync(0xffffffff, cSteps, id ^ 1, 32);
cIters += __shfl_sync(0xffffffff, cIters, id ^ 1, 32);
printf(" 1: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 2, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 2, 32);
printf(" 2: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 4, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 4, 32);
printf(" 4: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 8, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 8, 32);
printf(" 8: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 16, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 16, 32);
printf("16: Th %2d cI %d\n", threadIdx.x, cIters);
Even though I have a workaround, I'm afraid that I'm still hitting undefined behavior somewhere and my fix might be brittle.
Can anyone shed light on this? Is there indeed UB in my program? Is this a known compiler bug?