3

I am trying to implement a global reduction kernel in PTX which uses shared memory for reduction within a thread block (like all the CUDA C examples out there). In CUDA C on has the possibility to define an variable length array in shared memory with

extern __shared__ float sdata[];

How can I get the equivalent in PTX ?

What doesn't seem appropriate is a fixed length array like

.shared .f32 sdata[ LENGTH ];

Since I want the kernel to be reusable for different input array lengths.

What I could do is define one variable

.shared .f32 sdata;

and use it as the base address of the array. In the hope that it is allocated at the beginning of shared memory. I could then access array element like

ld.shared.f32 %r4,[sdata + <offset>]

Also this looks a bit funny because sdata is defined as a float. But what it really is is the address of a float. In this sense the above line is indeed correct. However I am not sure if this is guaranteed to be correct, say as long as the offset is not greater than the shared memory size specified at kernel launch.

The PTX manual doesn't explain variable length buffers in shared memory.

Anyone knows how to implement a variable length buffer in PTX?

paleonix
  • 2,293
  • 1
  • 13
  • 29
ritter
  • 7,447
  • 7
  • 51
  • 84
  • 4
    I haven't done this myself, but it seems that you could create a simple cuda C example that does what you want, then compile and keep the ptx output (nvcc -ptx ...) to see what it looks like in ptx code. – Robert Crovella Oct 26 '12 at 15:35
  • 2
    Like the idea! I did what you suggested. Turns out there is an undocumented `sdata` constant pointer which marks the start of shared memory. – ritter Oct 26 '12 at 16:22

2 Answers2

2

This works. However it is not the perfect solution because it introduces an extern linkage variable.

.version 2.3
.target sm_20
.extern .shared .align 4 .b8 sdata[];
.entry func (.param .s32 param0,...)
{
 //
 // Base addresses
 mov.u64 w2,sdata;  // shared memory
 ld.shared.s32 i9,[w2+0];
}
ritter
  • 7,447
  • 7
  • 51
  • 84
-2

In CUDA C on has the possibility to define an variable length array in shared memory with

extern __shared__ float sdata[];

This isn't a variable-length array in the usual sense of that term - it's just syntax for accessing a dynamically-limited amount of shared memory, which is set during kernel launch.

The fact that the CUDA compiler introduces an .extern definition is, TBH, an unfortunate implementation detail - which nVIDIA unfortunately exposed as part of the CUDA syntax.

einpoklum
  • 118,144
  • 57
  • 340
  • 684