Questions tagged [ptx]

Parallel Thread Execution (PTX) is a virtual machine instruction set architecture used in Nvidia's CUDA programming environment.

nVIDIA's GPUs have differing micro-architectures, the changes between which are not always incremental (like the addition of instructions to the with successive extensions). They all, however, share an intermediate (virtual) instruction set, somewhat similar to a compiler's intermediate representation. Specifically, it is somewhat of a parallel to the OpenCL-standard-related representation, . Continuing the compilation toolchain, PTX is further compiled into one of several GPU-microarchitecture specific assembly languages () for actual execution.

Here is an example of a simple CUDA kernel and the PTX resulting from its compilation:

__global__ void square(int *array, int length) {
    int pos = threadIdx.x + blockIdx.x * blockDim.x;
    if (pos < length)
        array[pos] = array[pos] * array[pos];
}

Resulting PTX (after name demangling):

.visible .entry square(int*, int)(
        .param .u64 square(int*, int)_param_0,
        .param .u32 square(int*, int)_param_1
)
{
        ld.param.u64        %rd1, [square(int*, int)_param_0];
        ld.param.u32        %r2, [square(int*, int)_param_1];
        mov.u32             %r3, %tid.x;
        mov.u32             %r4, %ntid.x;
        mov.u32             %r5, %ctaid.x;
        mad.lo.s32          %r1, %r4, %r5, %r3;
        setp.ge.s32         %p1, %r1, %r2;
        @%p1 bra            BB0_2;

        cvta.to.global.u64  %rd2, %rd1;
        mul.wide.s32        %rd3, %r1, 4;
        add.s64             %rd4, %rd2, %rd3;
        ld.global.u32       %r6, [%rd4];
        mul.lo.s32          %r7, %r6, %r6;
        st.global.u32       [%rd4], %r7;

        ret;
}

For more information on PTX in general, and on the specific instructions and data access syntax in the example above, consult the nVIDIA PTX Referene.

164 questions
1
vote
1 answer

What is the correct way to support `__shfl()` and `__shfl_sync()` instructions?

From my understanding, CUDA 10.1 removed the shfl instructions: PTX ISA version 6.4 removes the following features: Support for shfl and vote instructions without the .sync qualifier has been removed for .targetsm_70 and higher. This support was…
Blizzard
  • 1,117
  • 2
  • 11
  • 28
1
vote
1 answer

how to interpret ptx function names

when I compile my cuda file: nvcc -arch=sm_61 -std=c++11 -Xptxas -v,-warn-spills --use_fast_math -maxrregcount 128 nv_wavenet_perf.cu -o nv_wavenet_perf_dual I get many lines of register spill warnings: ptxas warning : Registers are spilled to…
CottonCandy
  • 444
  • 2
  • 5
  • 15
1
vote
1 answer

How do I check for overflow of integer arithmetic in CUDA?

In CUDA, how can I determine whether my last integer arithmetic operation has overflowed/underflowed or not? Can I get the value of an overflow flag?
einpoklum
  • 118,144
  • 57
  • 340
  • 684
1
vote
3 answers

Warp shuffling for CUDA

I need to make a warp shuffling that look like this: On this picture, the number of threads is limited to 8 to make it readable. If I read the Nvidia SDK and ptx manual, the shuffle instruction should do the job, specially the shfl.idx.b32 d[|p],…
Timocafé
  • 765
  • 6
  • 18
1
vote
1 answer

Can my kernel code tell how much shared memory it has available?

Is it possible for running device-side CUDA code to know how much (static and/or dynamic) shared memory is allocated to each block of the running kernel's grid? On the host side, you know how much shared memory a launched kernel had (or will have),…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
1
vote
1 answer

Loading a PTX programatically returns error 209 when run against device with CUDA capability 5.0

I am trying to use the ptxjit sample from the CUDA SDK as the basis for instrument the interaction with the GPU device. I've managed to successfully compile the instrumentation code, and control the device to load and execute a PTX module with a…
prmottajr
  • 1,816
  • 1
  • 13
  • 22
1
vote
1 answer

c++filt not aggressive enough for some of the mangled names in PTX files

I'm filtering my compiled PTX through c++filt, but it only demangles some of the names/labels and leaves some as-is. For example, this: func (.param .b32 func_retval0) _ZN41_INTERNAL_19_gather_bits_cpp1_ii_56538e7c6__shflEiii( .param .b32…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
1
vote
1 answer

In CUDA PTX, what does %warpid mean, really?

IN CUDA PTX, there's a special register which holds a thread's warp's index: %warpid. Now, the spec says: Note that %warpid is volatile and returns the location of a thread at the moment when read, but its value may change during execution, e.g.,…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
1
vote
1 answer

CUDA PTX f32.f32 texture read

Is it possible to read from a CUDA texture using a floating point index directly, e.g. can I perform a texture fetch using tex.1d.v4.f32.f32. This appears to save two instructions when looking at the .ptx files and this is reflected in an increased…
ebarr
  • 7,704
  • 1
  • 29
  • 40
1
vote
0 answers

Process a CUDA Array as a single object

I have been pounding my head against a wall trying to solve this issue for about a month now and neither my C skills nor my google-fu has been strong enough to come up with a solution. One of my favorite side projects has been and continues to be…
1
vote
1 answer

How can I get NVVM IR (LLVM IR) from .cu - file and how to compile NVVM IR to binary?

I have a CUDA C/C++ programm for CUDA 7.5. And as known: libNVVM Library - an optimizing compiler library that generates PTX from NVVM IR. I can get PTX by using: nvcc -ptx .cu -o .ptx But how can I get NVVM IR (LLVM IR) from…
Alex
  • 12,578
  • 15
  • 99
  • 195
1
vote
1 answer

LLVM error when creating array

I want to add the following line to my llvm code (as specified in the ptx backend documentation): %1 = internal addrspace(3) global [4 x i32] [ i32 0, i32 1, i32 2, i32 3 ] However, I get the following error: expected instruction opcode @g =…
PieterV
  • 816
  • 10
  • 23
1
vote
2 answers

Developing PTX instead of CUDA for optimization. Is it make sense?

I'm developing cuda code. But new device languages which are PTX or SPIR backends was announced. And i can come across some application which is being developed by them. At least i think we can say ptx language is enough to develop something at…
grypp
  • 405
  • 2
  • 15
1
vote
1 answer

Using ellipsis in cuda device function

I am trying to port some C code to a cuda kernel. The code I am porting uses ellipsis prevalently. When I try to use an ellipsis in a device function like below, I get an error saying that ellipsis are not allowed in device functions. __device__…
jim
  • 13
  • 3
1
vote
1 answer

LLVM NVPTX backend struct parameter zero size

I'm getting an obscure exception when loading the PTX assembly generated by LLVM's NVPTX backend. (I'm loading the PTX from ManagedCuda - http://managedcuda.codeplex.com/ ) ErrorNoBinaryForGPU: This indicates that there is no kernel image available…
user901037