Questions tagged [ptx]

Parallel Thread Execution (PTX) is a virtual machine instruction set architecture used in Nvidia's CUDA programming environment.

nVIDIA's GPUs have differing micro-architectures, the changes between which are not always incremental (like the addition of instructions to the with successive extensions). They all, however, share an intermediate (virtual) instruction set, somewhat similar to a compiler's intermediate representation. Specifically, it is somewhat of a parallel to the OpenCL-standard-related representation, . Continuing the compilation toolchain, PTX is further compiled into one of several GPU-microarchitecture specific assembly languages () for actual execution.

Here is an example of a simple CUDA kernel and the PTX resulting from its compilation:

__global__ void square(int *array, int length) {
    int pos = threadIdx.x + blockIdx.x * blockDim.x;
    if (pos < length)
        array[pos] = array[pos] * array[pos];
}

Resulting PTX (after name demangling):

.visible .entry square(int*, int)(
        .param .u64 square(int*, int)_param_0,
        .param .u32 square(int*, int)_param_1
)
{
        ld.param.u64        %rd1, [square(int*, int)_param_0];
        ld.param.u32        %r2, [square(int*, int)_param_1];
        mov.u32             %r3, %tid.x;
        mov.u32             %r4, %ntid.x;
        mov.u32             %r5, %ctaid.x;
        mad.lo.s32          %r1, %r4, %r5, %r3;
        setp.ge.s32         %p1, %r1, %r2;
        @%p1 bra            BB0_2;

        cvta.to.global.u64  %rd2, %rd1;
        mul.wide.s32        %rd3, %r1, 4;
        add.s64             %rd4, %rd2, %rd3;
        ld.global.u32       %r6, [%rd4];
        mul.lo.s32          %r7, %r6, %r6;
        st.global.u32       [%rd4], %r7;

        ret;
}

For more information on PTX in general, and on the specific instructions and data access syntax in the example above, consult the nVIDIA PTX Referene.

164 questions
4
votes
0 answers

UNREACHABLE executed! error while trying to generate PTX

I am trying to generate PTX code for 'nbody' sample program's kernel (nbody_kernel.cu) using clang/LLVM version 3.2. The nbody CUDA program is available in Nvidia's SDK. I am referring to https://github.com/jholewinski/llvm-ptx-samples…
lucent
  • 124
  • 8
4
votes
2 answers

Cuda error CUDA_ERROR_NO_BINARY_FOR_GPU

I have some PTX code that fails to load. I'm running this on a 650M, with OSX. Other CUDA examples run fine on the system, but when loading the module I always get error 209: CUDA_ERROR_NO_BINARY_FOR_GPU What am I missing? .version 3.1 .target…
Timothy Baldridge
  • 10,455
  • 1
  • 44
  • 80
4
votes
1 answer

CUDA PTX code %envreg<32> special registers

I tried to run a PTX assembly code generated by a .cl kernel with the CUDA driver API. The steps i took were these ( standard opencl procedure ): 1) Load .cl kernel 2) JIT compile it 3) Get the compiled ptx code and save it. So far so good. I…
mpekatsoula
  • 45
  • 1
  • 5
4
votes
1 answer

Natural logarithm implementation when only logarithm base 2 is available

I am trying to implement the natural logarithm with PTX. PTX natively only provides lg2.approx.f32 which implements the logarithm to base 2. Thus, applying simple maths one can get the natural logarithm by just multiplying the logarithm to base 2…
ritter
  • 7,447
  • 7
  • 51
  • 84
3
votes
1 answer

Differences between NVCC and NVRTC on compilation to PTX

Summary I'm porting a simple raytracing application based on the Scratchapixel version to a bunch of GPU libraries. I sucessfully ported it to CUDA using the runtime API and the driver API, but It throws a Segmentation fault (core dumped) when I try…
Dinei
  • 4,494
  • 4
  • 36
  • 60
3
votes
1 answer

Some intrinsics named with `_sync()` appended in CUDA 9; semantics same?

In CUDA 9, nVIDIA seems to have this new notion of "cooperative groups"; and for some reason not entirely clear to me, __ballot() is now (= CUDA 9) deprecated in favor of __ballot_sync(). Is that an alias or have the semantics changed? ... similar…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
3
votes
1 answer

How can I utilize the 'red' and 'atom' PTX instructions in CUDA C++ code?

The CUDA PTX Guide describes the instructions 'atom' and 'red', which perform atomic and non-atomic reductions. This is news to me (at least with respect to non-atomic reductions)... I remember learning how to do reductions with SHFL a while back.…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
3
votes
1 answer

How to understand the result of SASS analysis in CUDA/GPU

I used cuobjdump, one of the CUDA Binary Utilities, to generate the SASS code, sample results are as below. These codes try to load from global memory. /*0028*/ IMAD R6.CC, R3, R5, c[0x0][0x20]; /*0030*/ IMAD.HI.X R7, R3, R5, c[0x0][0x24];…
Steven Huang
  • 153
  • 1
  • 13
3
votes
1 answer

Apparently redundant operations in disassembled CUDA microcode

I have the following kernel performing a simple assignment of a global memory matrix in to a global memory matrix out: __global__ void simple_copy(float *outdata, const float *indata){ int x = blockIdx.x * TILE_DIM + threadIdx.x; int y =…
Vitality
  • 20,705
  • 4
  • 108
  • 146
3
votes
2 answers

PTX variable length buffer in shared memory

I am trying to implement a global reduction kernel in PTX which uses shared memory for reduction within a thread block (like all the CUDA C examples out there). In CUDA C on has the possibility to define an variable length array in shared memory…
ritter
  • 7,447
  • 7
  • 51
  • 84
2
votes
0 answers

How to compare AT&T-assembly-like sources (e.g. CUDA PTX)?

I want to compare two pieces of low-level code, each in its own file. The format is AT&T-Assembly-style: For me, it's two CUDA PTX files, but this question applies also for the output of gcc -S or clang -S. For the sake of discussion assume that…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
2
votes
1 answer

Are load and store operations in shared memory atomic?

I'm trying to figure out whether load and store operations on primitive types are atomics when we load/store from shared memory in CUDA. On the one hand, it seems that any load/store is compiled to the PTX instruction ld.weak.shared.cta which does…
Pierre T.
  • 380
  • 1
  • 13
2
votes
0 answers

Get the PTX dump when running TensorRT

I am running an ONNX model through TensorRT. I can verify that inference is running on the GPU through the results and nvsys profile logs. However, I would like to see the corresponding PTX binary that TensorRT generates for my input model. Is there…
mikepapadim
  • 443
  • 2
  • 14
2
votes
1 answer

Understanding the parameters of PTX instruction mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32

How to understand the parameters in the following snippet of CUDA inline assembly code? ...... asm volatile( \ "mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 \n" \ " {%0, %1, %2, %3}, \n" \ " …
inprocess
  • 31
  • 3
2
votes
1 answer

CUDA inline PTX ld.shared runs into cudaErrorIllegalAddress error

I'm using inline PTX ld.shared to load data from shared memory: __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; //declare a buffer in shared memory float Csub = 0; As[TY][TX] = A[a + wA * TY + TX]; //load data from global memory to shared…
Yichen
  • 91
  • 1
  • 5
1 2
3
10 11