Questions tagged [ptx]

Parallel Thread Execution (PTX) is a virtual machine instruction set architecture used in Nvidia's CUDA programming environment.

nVIDIA's GPUs have differing micro-architectures, the changes between which are not always incremental (like the addition of instructions to the x86-64 with successive avx extensions). They all, however, share an intermediate (virtual) instruction set, somewhat similar to a compiler's intermediate representation. Specifically, it is somewhat of a parallel to the OpenCL-standard-related representation, spir-v. Continuing the compilation toolchain, PTX is further compiled into one of several GPU-microarchitecture specific assembly languages (sass) for actual execution.

Here is an example of a simple CUDA kernel and the PTX resulting from its compilation:

__global__ void square(int *array, int length) {
    int pos = threadIdx.x + blockIdx.x * blockDim.x;
    if (pos < length)
        array[pos] = array[pos] * array[pos];
}

Resulting PTX (after name demangling):

.visible .entry square(int*, int)(
        .param .u64 square(int*, int)_param_0,
        .param .u32 square(int*, int)_param_1
)
{
        ld.param.u64        %rd1, [square(int*, int)_param_0];
        ld.param.u32        %r2, [square(int*, int)_param_1];
        mov.u32             %r3, %tid.x;
        mov.u32             %r4, %ntid.x;
        mov.u32             %r5, %ctaid.x;
        mad.lo.s32          %r1, %r4, %r5, %r3;
        setp.ge.s32         %p1, %r1, %r2;
        @%p1 bra            BB0_2;

        cvta.to.global.u64  %rd2, %rd1;
        mul.wide.s32        %rd3, %r1, 4;
        add.s64             %rd4, %rd2, %rd3;
        ld.global.u32       %r6, [%rd4];
        mul.lo.s32          %r7, %r6, %r6;
        st.global.u32       [%rd4], %r7;

        ret;
}

For more information on PTX in general, and on the specific instructions and data access syntax in the example above, consult the nVIDIA PTX Referene.

164 questions

votes

0 answers

UNREACHABLE executed! error while trying to generate PTX

I am trying to generate PTX code for 'nbody' sample program's kernel (nbody_kernel.cu) using clang/LLVM version 3.2. The nbody CUDA program is available in Nvidia's SDK. I am referring to https://github.com/jholewinski/llvm-ptx-samples…

clang llvm llvm-clang ptx

asked Mar 08 '13 at 21:06

lucent

votes

2 answers

Cuda error CUDA_ERROR_NO_BINARY_FOR_GPU

I have some PTX code that fails to load. I'm running this on a 650M, with OSX. Other CUDA examples run fine on the system, but when loading the module I always get error 209: CUDA_ERROR_NO_BINARY_FOR_GPU What am I missing? .version 3.1 .target…

cuda ptx

asked Mar 02 '13 at 00:02

Timothy Baldridge

10,455
1
44
80

votes

1 answer

CUDA PTX code %envreg<32> special registers

I tried to run a PTX assembly code generated by a .cl kernel with the CUDA driver API. The steps i took were these ( standard opencl procedure ): 1) Load .cl kernel 2) JIT compile it 3) Get the compiled ptx code and save it. So far so good. I…

cuda opencl ptx

asked Dec 10 '12 at 15:26

mpekatsoula

votes

1 answer

Natural logarithm implementation when only logarithm base 2 is available

I am trying to implement the natural logarithm with PTX. PTX natively only provides lg2.approx.f32 which implements the logarithm to base 2. Thus, applying simple maths one can get the natural logarithm by just multiplying the logarithm to base 2…

c logging cuda ptx

asked Nov 16 '12 at 16:45

ritter

7,447
7
51
84

votes

1 answer

Differences between NVCC and NVRTC on compilation to PTX

Summary I'm porting a simple raytracing application based on the Scratchapixel version to a bunch of GPU libraries. I sucessfully ported it to CUDA using the runtime API and the driver API, but It throws a Segmentation fault (core dumped) when I try…

c++ cuda ptx nvrtc

asked Apr 01 '20 at 03:37

Dinei

4,494
4
36
60

votes

1 answer

Some intrinsics named with `_sync()` appended in CUDA 9; semantics same?

In CUDA 9, nVIDIA seems to have this new notion of "cooperative groups"; and for some reason not entirely clear to me, __ballot() is now (= CUDA 9) deprecated in favor of __ballot_sync(). Is that an alias or have the semantics changed? ... similar…

cuda ptx gpu-warp

asked Sep 27 '17 at 22:15

einpoklum

118,144
57
340
684

votes

1 answer

How can I utilize the 'red' and 'atom' PTX instructions in CUDA C++ code?

The CUDA PTX Guide describes the instructions 'atom' and 'red', which perform atomic and non-atomic reductions. This is news to me (at least with respect to non-atomic reductions)... I remember learning how to do reductions with SHFL a while back.…

c++ cuda atomic instruction-set ptx

asked Apr 25 '16 at 19:43

einpoklum

118,144
57
340
684

votes

1 answer

How to understand the result of SASS analysis in CUDA/GPU

I used cuobjdump, one of the CUDA Binary Utilities, to generate the SASS code, sample results are as below. These codes try to load from global memory. /*0028*/ IMAD R6.CC, R3, R5, c[0x0][0x20]; /*0030*/ IMAD.HI.X R7, R3, R5, c[0x0][0x24];…

assembly cuda gpu ptx

asked Jan 28 '16 at 07:14

Steven Huang

votes

1 answer

Apparently redundant operations in disassembled CUDA microcode

I have the following kernel performing a simple assignment of a global memory matrix in to a global memory matrix out: __global__ void simple_copy(float *outdata, const float *indata){ int x = blockIdx.x * TILE_DIM + threadIdx.x; int y =…

cuda disassembly ptx

asked Oct 01 '13 at 21:10

Vitality

20,705
4
108
146

votes

2 answers

PTX variable length buffer in shared memory

I am trying to implement a global reduction kernel in PTX which uses shared memory for reduction within a thread block (like all the CUDA C examples out there). In CUDA C on has the possibility to define an variable length array in shared memory…

cuda buffer gpu-shared-memory ptx

asked Oct 26 '12 at 15:07

ritter

7,447
7
51
84

votes

0 answers

How to compare AT&T-assembly-like sources (e.g. CUDA PTX)?

I want to compare two pieces of low-level code, each in its own file. The format is AT&T-Assembly-style: For me, it's two CUDA PTX files, but this question applies also for the output of gcc -S or clang -S. For the sake of discussion assume that…

cuda diff att ptx file-comparison

asked Nov 30 '22 at 13:54

einpoklum

118,144
57
340
684

votes

1 answer

Are load and store operations in shared memory atomic?

I'm trying to figure out whether load and store operations on primitive types are atomics when we load/store from shared memory in CUDA. On the one hand, it seems that any load/store is compiled to the PTX instruction ld.weak.shared.cta which does…

cuda atomic multicore gpu-shared-memory ptx

asked Jun 11 '22 at 10:48

Pierre T.

votes

0 answers

Get the PTX dump when running TensorRT

I am running an ONNX model through TensorRT. I can verify that inference is running on the GPU through the results and nvsys profile logs. However, I would like to see the corresponding PTX binary that TensorRT generates for my input model. Is there…

gpu nvidia tensorrt ptx

asked Feb 14 '22 at 17:16

mikepapadim

votes

1 answer

Understanding the parameters of PTX instruction mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32

How to understand the parameters in the following snippet of CUDA inline assembly code? ...... asm volatile( \ "mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 \n" \ " {%0, %1, %2, %3}, \n" \ " …

cuda matrix-multiplication inline-assembly ptx

asked Jan 12 '22 at 15:13

inprocess

votes

1 answer

CUDA inline PTX ld.shared runs into cudaErrorIllegalAddress error

I'm using inline PTX ld.shared to load data from shared memory: __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; //declare a buffer in shared memory float Csub = 0; As[TY][TX] = A[a + wA * TY + TX]; //load data from global memory to shared…

cuda inline ptx gpu-shared-memory

asked Nov 12 '21 at 00:07

Yichen

Prev 1 2

…

10 11 Next