2

I'm using inline PTX ld.shared to load data from shared memory:

__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];  //declare a buffer in shared memory
float Csub = 0;

As[TY][TX] = A[a + wA * TY + TX];             //load data from global memory to shared memory
__syncthreads();
float t;
asm("ld.shared.f32 %0, [%1];" :"=f"(t) : "r"((int)&As[TY][k]));  //load data from shared memory into t
Csub += t;
__syncthreads();

But it runs into an error:

CUDA error at C:/ProgramData/NVIDIA Corporation/CUDA Samples/v11.2/0_Simple/matrixMul_mine/matrixMul.cu:196 code=700(cudaErrorIllegalAddress) "cudaStreamSynchronize(stream)" 

I dumped the SASS and found that the LDS happens even earlier than LDG and the two bar.sync instructions. It seems that the compiler looses track of the data dependency.

So my questions are:

  1. Is there anything wrong in my inline PTX that leads to cudaErrorIllegalAddress?
  2. Does inline PTX disturb the compilers ability to track data dependencies?
paleonix
  • 2,293
  • 1
  • 13
  • 29
Yichen
  • 91
  • 1
  • 5
  • 1
    https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html#incorrect-optimization – talonmies Nov 12 '21 at 04:47
  • 1
    Perhaps you also have to declare As as volatile in addition to talonmies answer, the illegal address probably is independent from the order - but we cannot see where your parameters a, wA, TY, TK, k are coming from, how large the global array is or with what block and grid size your kernel is called. – Sebastian Nov 12 '21 at 05:45
  • 2
    Just learnt that the address need to be converted by "cvta" like [link](https://github.com/NVIDIA/cutlass/blob/6fc5008803fe4e81b81a836fcd3a88258f4e5bbf/include/cutlass/arch/memory_sm75.h#L90) in cutlass. And as Sebastian said, 'volatile' should be added right after 'asm'. – Yichen Nov 12 '21 at 09:10

1 Answers1

4

Yichen's comment is right.

There are two types of addressing: ld. or ld.statespace.

If ld. is used on its own, the address should be a generic address. The generic address, to my limited understanding, is the CUDA-C++ pointer value, like &As[TY][k] in your code.

If ld.statespace is used, the address should be the address in the state space.

I think if you use ld.f32 instead of ld.shared.f32, your code should be okay. BTW, I don't think you can use the generic address in 32-bit data width, which can truncate the generic address into a wrong value.

Or you can convert the generic address to the shared space address. Here is CUTLASS' conversion code:

      ".reg .u32 smem_ptr32;\n\t"
      ".reg .u64 smem_ptr64; cvta.to.shared.u64 smem_ptr64, %1; cvt.u32.u64 smem_ptr32, smem_ptr64; \n\t"

Then use smem_ptr32 instead of [%1]:

"ld.shared.f32 %0, [smem_ptr32];"

As the PTX ISA says, this address can be either 32-bit or 64-bit. I think it's not necessary to convert the 64-bit pointer to a 32-bit pointer. Using smem_ptr64 should be okay.

Here is what the shared memory address could look like:

Pointer State Space Value
CUDA-C++ pointer Generic Space 1526743433216 + 1024
smem_ptr64 Shared Space 0 + 1024
paleonix
  • 2,293
  • 1
  • 13
  • 29
Mr.Ly
  • 153
  • 10