Questions tagged [ptx]

Parallel Thread Execution (PTX) is a virtual machine instruction set architecture used in Nvidia's CUDA programming environment.

nVIDIA's GPUs have differing micro-architectures, the changes between which are not always incremental (like the addition of instructions to the x86-64 with successive avx extensions). They all, however, share an intermediate (virtual) instruction set, somewhat similar to a compiler's intermediate representation. Specifically, it is somewhat of a parallel to the OpenCL-standard-related representation, spir-v. Continuing the compilation toolchain, PTX is further compiled into one of several GPU-microarchitecture specific assembly languages (sass) for actual execution.

Here is an example of a simple CUDA kernel and the PTX resulting from its compilation:

__global__ void square(int *array, int length) {
    int pos = threadIdx.x + blockIdx.x * blockDim.x;
    if (pos < length)
        array[pos] = array[pos] * array[pos];
}

Resulting PTX (after name demangling):

.visible .entry square(int*, int)(
        .param .u64 square(int*, int)_param_0,
        .param .u32 square(int*, int)_param_1
)
{
        ld.param.u64        %rd1, [square(int*, int)_param_0];
        ld.param.u32        %r2, [square(int*, int)_param_1];
        mov.u32             %r3, %tid.x;
        mov.u32             %r4, %ntid.x;
        mov.u32             %r5, %ctaid.x;
        mad.lo.s32          %r1, %r4, %r5, %r3;
        setp.ge.s32         %p1, %r1, %r2;
        @%p1 bra            BB0_2;

        cvta.to.global.u64  %rd2, %rd1;
        mul.wide.s32        %rd3, %r1, 4;
        add.s64             %rd4, %rd2, %rd3;
        ld.global.u32       %r6, [%rd4];
        mul.lo.s32          %r7, %r6, %r6;
        st.global.u32       [%rd4], %r7;

        ret;
}

For more information on PTX in general, and on the specific instructions and data access syntax in the example above, consult the nVIDIA PTX Referene.

164 questions

votes

1 answer

What does it mean when a variable "has been demoted" in the PTX?

In the function body of my CUDA kernel, I have a few __shared__ array variables, of a fixed size. When I look at the compiled PTX code (SM 7.5) for one of these arrays, I see a comment saying: // my_kernel(t1 p1, t2 p2)::my_variable has been…

cuda nvcc ptx

asked Jul 27 '21 at 10:20

einpoklum

118,144
57
340
684

votes

1 answer

How to compile cuda code with calling one function twice inside one method?

I'am try to compile the piece of code: struct foo { unsigned long long x0; }; //__device__ __noinline__ foo bar(foo a, foo b){ // << try this __device__ foo bar(foo a, foo b){ foo r; asm( ".reg .u64 my_cool_var;\n\t" …

cuda ptx

asked Feb 28 '21 at 00:01

chabapok

votes

1 answer

In asm volatile inline PTX instructions, why also specify "memory" side effecs?

Consider the following excerpt from CUDA's Inline PTX Assebly guide (v10.2): The compiler assumes that an asm() statement has no side effects except to change the output operands. To ensure that the asm is not deleted or moved during generation…

cuda language-lawyer inline-assembly redundancy ptx

asked Apr 29 '20 at 09:02

einpoklum

118,144
57
340
684

votes

1 answer

LLVM IR of OpenCL kernel to PTX to binary

I am using clang to generate LLVM IR for Nvidia OpenCL and Cuda kernels, which i want to subsequently instrument, doing something like this for OpenCL: clang -c -x cl -S -emit-llvm -cl-std=CL2.0 kernel.cl -o kernel.ll and what's described here for…

clang opencl llvm ptx

asked Nov 01 '19 at 11:11

0x6K5

votes

2 answers

How to pass compiler flags to nvcc from clang

I am trying to compile CUDA with clang, but the code I am trying to compile depends on a specific nvcc flag (-default-stream per-thread). How can I tell clang to pass the flag to nvcc? For example, I can compile with nvcc and everythign works…

c++ cuda cross-compiling clang++ ptx

asked Oct 11 '19 at 19:33

Increasingly Idiotic

5,700
5
35
73

votes

1 answer

Understanding cuobjdump output

I already read about virtual architecture and code generation for nvcc but I still have some questions. I have a cuda compiled executable whose cuobjdump output is Fatbin elf code: ================ arch = sm_20 code version = [1,7] producer =…

linux cuda gpu nvcc ptx

asked Sep 09 '19 at 09:48

Dean

6,610
6
40
90

votes

0 answers

How can I use NVIDIA's PTX code to draw graphics on the screen?

Suppose I have a bare bones bootloader running some x86 instructions, and I want to be able to draw some graphics on the screen, without using legacy options like system interrupts. How can I write PTX code to achieve this, and how can I actually…

assembly x86 ptx

asked Jun 08 '18 at 21:35

MazeOfEncryption

votes

2 answers

How do I do the converse of shfl.idx (i.e. warp scatter instead of warp gather)?

With CUDA's shfl.idx instruction, we perform what is essentially an intra-warp gather: Each lane provides a datum and an origin lane, and gets the datum of the origin lane. What about the converse operation, scatter? I mean, not scattering to…

cuda shuffle ptx gpu-warp

asked Feb 07 '18 at 00:11

einpoklum

118,144
57
340
684

votes

2 answers

How do I get a PTX file to execute

I know how to generate a .ptx file from a .cu and how to generate a .cubin file from a .ptx. But I don't know how to get the final executable. More specifically, I have a sample.cu file, which is compiled to sample.ptx. I then use nvcc to compile…

cuda nvcc ptx

asked Jan 25 '11 at 20:08

dalibocai

2,289
5
29
45

votes

1 answer

What does thread-count mean for bar.arrive PTX barrier synchronization instruction?

It is mentioned here in the PTX documentation that bar.sync and bar.arrive barrier synchronization instructions can be used as below: bar.sync a{, b}; bar.arrive a, b; Where Source operand a specifies a logical barrier resource as an…

cuda ptx

asked Jun 11 '17 at 18:38

Farzad

3,288
2
29
53

votes

2 answers

CUDA - PTX carry propagation

I want to add two 32-bit unsigned integers in CUDA PTX and I also want to take care of the carry propagation. I am using the code below to do that, but the result is not as expected. Acording to the documentation, the add.cc.u32 d, a, b performs…

cuda ptx carryflag

asked Mar 26 '16 at 15:55

Dani Grosu

votes

0 answers

"Redundant" Move Operations to same Register in Cuda

I was viewing the CUDA SASS code, and I noticed a large amount of move operations to the same registers. Ex: 172 MOV R3, R3; 173 MOV R4, R4; 174 MOV R3, R3; 175 MOV R4, R4; 176 MOV R4, R4; 177 …

assembly cuda ptx

asked Feb 01 '16 at 23:31

Dane Bouchie

votes

0 answers

How to prevent FTZ for a single line in CUDA

I am working on a particle code where flushing-to-zero is extensively used to extract performance. However there is a single floating point comparison statement that I do not wish to be flushed. One solution is to use inline PTX, but it introduces…

c++ performance assembly cuda ptx

asked Apr 10 '15 at 13:54

Rainn

votes

1 answer

Why doesn't OpenCL Nvidia compiler (nvcc) use the registers twice?

I'm doing a small OpenCL benchmark using Nvidia drivers, my kernel performs 1024 fuse multiply-adds and store the result in an array: #define FLOPS_MACRO_1(x) { (x) = (x) * 0.99f + 10.f; } // Multiply-add #define FLOPS_MACRO_2(x) {…

opencl nvidia nvcc ptx

asked Feb 27 '15 at 14:26

GaTTaCa

votes

1 answer

Linking a kernel to a PTX function

Can I use a PTX function contained in a PTX file as an external device function to link it to another .cu file which should call that function? This is another question from CUDA - link kernels together where the function itself is not contained in…

c++ cuda ptx

asked Dec 18 '13 at 11:32

Marco A.

43,032
26
132
246

Prev 1 2 3

…

10 11 Next