Questions tagged [ptx]

Parallel Thread Execution (PTX) is a virtual machine instruction set architecture used in Nvidia's CUDA programming environment.

nVIDIA's GPUs have differing micro-architectures, the changes between which are not always incremental (like the addition of instructions to the with successive extensions). They all, however, share an intermediate (virtual) instruction set, somewhat similar to a compiler's intermediate representation. Specifically, it is somewhat of a parallel to the OpenCL-standard-related representation, . Continuing the compilation toolchain, PTX is further compiled into one of several GPU-microarchitecture specific assembly languages () for actual execution.

Here is an example of a simple CUDA kernel and the PTX resulting from its compilation:

__global__ void square(int *array, int length) {
    int pos = threadIdx.x + blockIdx.x * blockDim.x;
    if (pos < length)
        array[pos] = array[pos] * array[pos];
}

Resulting PTX (after name demangling):

.visible .entry square(int*, int)(
        .param .u64 square(int*, int)_param_0,
        .param .u32 square(int*, int)_param_1
)
{
        ld.param.u64        %rd1, [square(int*, int)_param_0];
        ld.param.u32        %r2, [square(int*, int)_param_1];
        mov.u32             %r3, %tid.x;
        mov.u32             %r4, %ntid.x;
        mov.u32             %r5, %ctaid.x;
        mad.lo.s32          %r1, %r4, %r5, %r3;
        setp.ge.s32         %p1, %r1, %r2;
        @%p1 bra            BB0_2;

        cvta.to.global.u64  %rd2, %rd1;
        mul.wide.s32        %rd3, %r1, 4;
        add.s64             %rd4, %rd2, %rd3;
        ld.global.u32       %r6, [%rd4];
        mul.lo.s32          %r7, %r6, %r6;
        st.global.u32       [%rd4], %r7;

        ret;
}

For more information on PTX in general, and on the specific instructions and data access syntax in the example above, consult the nVIDIA PTX Referene.

164 questions
2
votes
1 answer

What does it mean when a variable "has been demoted" in the PTX?

In the function body of my CUDA kernel, I have a few __shared__ array variables, of a fixed size. When I look at the compiled PTX code (SM 7.5) for one of these arrays, I see a comment saying: // my_kernel(t1 p1, t2 p2)::my_variable has been…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
2
votes
1 answer

How to compile cuda code with calling one function twice inside one method?

I'am try to compile the piece of code: struct foo { unsigned long long x0; }; //__device__ __noinline__ foo bar(foo a, foo b){ // << try this __device__ foo bar(foo a, foo b){ foo r; asm( ".reg .u64 my_cool_var;\n\t" …
chabapok
  • 921
  • 1
  • 8
  • 14
2
votes
1 answer

In asm volatile inline PTX instructions, why also specify "memory" side effecs?

Consider the following excerpt from CUDA's Inline PTX Assebly guide (v10.2): The compiler assumes that an asm() statement has no side effects except to change the output operands. To ensure that the asm is not deleted or moved during generation…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
2
votes
1 answer

LLVM IR of OpenCL kernel to PTX to binary

I am using clang to generate LLVM IR for Nvidia OpenCL and Cuda kernels, which i want to subsequently instrument, doing something like this for OpenCL: clang -c -x cl -S -emit-llvm -cl-std=CL2.0 kernel.cl -o kernel.ll and what's described here for…
0x6K5
  • 57
  • 6
2
votes
2 answers

How to pass compiler flags to nvcc from clang

I am trying to compile CUDA with clang, but the code I am trying to compile depends on a specific nvcc flag (-default-stream per-thread). How can I tell clang to pass the flag to nvcc? For example, I can compile with nvcc and everythign works…
Increasingly Idiotic
  • 5,700
  • 5
  • 35
  • 73
2
votes
1 answer

Understanding cuobjdump output

I already read about virtual architecture and code generation for nvcc but I still have some questions. I have a cuda compiled executable whose cuobjdump output is Fatbin elf code: ================ arch = sm_20 code version = [1,7] producer =…
Dean
  • 6,610
  • 6
  • 40
  • 90
2
votes
0 answers

How can I use NVIDIA's PTX code to draw graphics on the screen?

Suppose I have a bare bones bootloader running some x86 instructions, and I want to be able to draw some graphics on the screen, without using legacy options like system interrupts. How can I write PTX code to achieve this, and how can I actually…
MazeOfEncryption
  • 365
  • 3
  • 13
2
votes
2 answers

How do I do the converse of shfl.idx (i.e. warp scatter instead of warp gather)?

With CUDA's shfl.idx instruction, we perform what is essentially an intra-warp gather: Each lane provides a datum and an origin lane, and gets the datum of the origin lane. What about the converse operation, scatter? I mean, not scattering to…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
2
votes
2 answers

How do I get a PTX file to execute

I know how to generate a .ptx file from a .cu and how to generate a .cubin file from a .ptx. But I don't know how to get the final executable. More specifically, I have a sample.cu file, which is compiled to sample.ptx. I then use nvcc to compile…
dalibocai
  • 2,289
  • 5
  • 29
  • 45
2
votes
1 answer

What does thread-count mean for bar.arrive PTX barrier synchronization instruction?

It is mentioned here in the PTX documentation that bar.sync and bar.arrive barrier synchronization instructions can be used as below: bar.sync a{, b}; bar.arrive a, b; Where Source operand a specifies a logical barrier resource as an…
Farzad
  • 3,288
  • 2
  • 29
  • 53
2
votes
2 answers

CUDA - PTX carry propagation

I want to add two 32-bit unsigned integers in CUDA PTX and I also want to take care of the carry propagation. I am using the code below to do that, but the result is not as expected. Acording to the documentation, the add.cc.u32 d, a, b performs…
Dani Grosu
  • 544
  • 1
  • 4
  • 22
2
votes
0 answers

"Redundant" Move Operations to same Register in Cuda

I was viewing the CUDA SASS code, and I noticed a large amount of move operations to the same registers. Ex: 172 MOV R3, R3; 173 MOV R4, R4; 174 MOV R3, R3; 175 MOV R4, R4; 176 MOV R4, R4; 177 …
Dane Bouchie
  • 421
  • 5
  • 11
2
votes
0 answers

How to prevent FTZ for a single line in CUDA

I am working on a particle code where flushing-to-zero is extensively used to extract performance. However there is a single floating point comparison statement that I do not wish to be flushed. One solution is to use inline PTX, but it introduces…
Rainn
  • 315
  • 1
  • 9
2
votes
1 answer

Why doesn't OpenCL Nvidia compiler (nvcc) use the registers twice?

I'm doing a small OpenCL benchmark using Nvidia drivers, my kernel performs 1024 fuse multiply-adds and store the result in an array: #define FLOPS_MACRO_1(x) { (x) = (x) * 0.99f + 10.f; } // Multiply-add #define FLOPS_MACRO_2(x) {…
GaTTaCa
  • 459
  • 6
  • 18
2
votes
1 answer

Linking a kernel to a PTX function

Can I use a PTX function contained in a PTX file as an external device function to link it to another .cu file which should call that function? This is another question from CUDA - link kernels together where the function itself is not contained in…
Marco A.
  • 43,032
  • 26
  • 132
  • 246