I am trying to implement a global reduction kernel in PTX which uses shared memory for reduction within a thread block (like all the CUDA C examples out there). In CUDA C on has the possibility to define an variable length array in shared memory with
extern __shared__ float sdata[];
How can I get the equivalent in PTX ?
What doesn't seem appropriate is a fixed length array like
.shared .f32 sdata[ LENGTH ];
Since I want the kernel to be reusable for different input array lengths.
What I could do is define one variable
.shared .f32 sdata;
and use it as the base address of the array. In the hope that it is allocated at the beginning of shared memory. I could then access array element like
ld.shared.f32 %r4,[sdata + <offset>]
Also this looks a bit funny because sdata
is defined as a float
. But what it really is is the address of a float. In this sense the above line is indeed correct.
However I am not sure if this is guaranteed to be correct, say as long as the offset is not greater than the shared memory size specified at kernel launch.
The PTX manual doesn't explain variable length buffers in shared memory.
Anyone knows how to implement a variable length buffer in PTX?