I'm using inline PTX ld.shared
to load data from shared memory:
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; //declare a buffer in shared memory
float Csub = 0;
As[TY][TX] = A[a + wA * TY + TX]; //load data from global memory to shared memory
__syncthreads();
float t;
asm("ld.shared.f32 %0, [%1];" :"=f"(t) : "r"((int)&As[TY][k])); //load data from shared memory into t
Csub += t;
__syncthreads();
But it runs into an error:
CUDA error at C:/ProgramData/NVIDIA Corporation/CUDA Samples/v11.2/0_Simple/matrixMul_mine/matrixMul.cu:196 code=700(cudaErrorIllegalAddress) "cudaStreamSynchronize(stream)"
I dumped the SASS and found that the LDS
happens even earlier than LDG
and the two bar.sync
instructions. It seems that the compiler looses track of the data dependency.
So my questions are:
- Is there anything wrong in my inline PTX that leads to
cudaErrorIllegalAddress
? - Does inline PTX disturb the compilers ability to track data dependencies?