Questions tagged [ptx]

Parallel Thread Execution (PTX) is a virtual machine instruction set architecture used in Nvidia's CUDA programming environment.

nVIDIA's GPUs have differing micro-architectures, the changes between which are not always incremental (like the addition of instructions to the x86-64 with successive avx extensions). They all, however, share an intermediate (virtual) instruction set, somewhat similar to a compiler's intermediate representation. Specifically, it is somewhat of a parallel to the OpenCL-standard-related representation, spir-v. Continuing the compilation toolchain, PTX is further compiled into one of several GPU-microarchitecture specific assembly languages (sass) for actual execution.

Here is an example of a simple CUDA kernel and the PTX resulting from its compilation:

__global__ void square(int *array, int length) {
    int pos = threadIdx.x + blockIdx.x * blockDim.x;
    if (pos < length)
        array[pos] = array[pos] * array[pos];
}

Resulting PTX (after name demangling):

.visible .entry square(int*, int)(
        .param .u64 square(int*, int)_param_0,
        .param .u32 square(int*, int)_param_1
)
{
        ld.param.u64        %rd1, [square(int*, int)_param_0];
        ld.param.u32        %r2, [square(int*, int)_param_1];
        mov.u32             %r3, %tid.x;
        mov.u32             %r4, %ntid.x;
        mov.u32             %r5, %ctaid.x;
        mad.lo.s32          %r1, %r4, %r5, %r3;
        setp.ge.s32         %p1, %r1, %r2;
        @%p1 bra            BB0_2;

        cvta.to.global.u64  %rd2, %rd1;
        mul.wide.s32        %rd3, %r1, 4;
        add.s64             %rd4, %rd2, %rd3;
        ld.global.u32       %r6, [%rd4];
        mul.lo.s32          %r7, %r6, %r6;
        st.global.u32       [%rd4], %r7;

        ret;
}

For more information on PTX in general, and on the specific instructions and data access syntax in the example above, consult the nVIDIA PTX Referene.

164 questions

vote

1 answer

What is the correct way to support `shfl()` and `shfl_sync()` instructions?

From my understanding, CUDA 10.1 removed the shfl instructions: PTX ISA version 6.4 removes the following features: Support for shfl and vote instructions without the .sync qualifier has been removed for .targetsm_70 and higher. This support was…

cuda ptx ptxas

asked Apr 29 '19 at 21:57

Blizzard

1,117
2
11
28

vote

1 answer

how to interpret ptx function names

when I compile my cuda file: nvcc -arch=sm_61 -std=c++11 -Xptxas -v,-warn-spills --use_fast_math -maxrregcount 128 nv_wavenet_perf.cu -o nv_wavenet_perf_dual I get many lines of register spill warnings: ptxas warning : Registers are spilled to…

cuda nvcc ptx

asked Sep 13 '18 at 06:27

CottonCandy

vote

1 answer

How do I check for overflow of integer arithmetic in CUDA?

In CUDA, how can I determine whether my last integer arithmetic operation has overflowed/underflowed or not? Can I get the value of an overflow flag?

cuda integer-overflow ptx

asked May 30 '18 at 16:35

einpoklum

118,144
57
340
684

vote

3 answers

Warp shuffling for CUDA

I need to make a warp shuffling that look like this: On this picture, the number of threads is limited to 8 to make it readable. If I read the Nvidia SDK and ptx manual, the shuffle instruction should do the job, specially the shfl.idx.b32 d[|p],…

cuda shuffle ptx gpu-warp

asked Mar 09 '18 at 16:19

Timocafé

vote

1 answer

Can my kernel code tell how much shared memory it has available?

Is it possible for running device-side CUDA code to know how much (static and/or dynamic) shared memory is allocated to each block of the running kernel's grid? On the host side, you know how much shared memory a launched kernel had (or will have),…

cuda gpgpu ptx gpu-shared-memory

asked Feb 17 '17 at 23:52

einpoklum

118,144
57
340
684

vote

1 answer

Loading a PTX programatically returns error 209 when run against device with CUDA capability 5.0

I am trying to use the ptxjit sample from the CUDA SDK as the basis for instrument the interaction with the GPU device. I've managed to successfully compile the instrumentation code, and control the device to load and execute a PTX module with a…

cuda gpu ptx

asked Oct 03 '16 at 04:15

prmottajr

1,816
1
13
22

vote

1 answer

c++filt not aggressive enough for some of the mangled names in PTX files

I'm filtering my compiled PTX through c++filt, but it only demangles some of the names/labels and leaves some as-is. For example, this: func (.param .b32 func_retval0) _ZN41_INTERNAL_19_gather_bits_cpp1_ii_56538e7c6__shflEiii( .param .b32…

c++ cuda name-mangling ptx c++filt

asked Aug 31 '16 at 12:03

einpoklum

118,144
57
340
684

vote

1 answer

In CUDA PTX, what does %warpid mean, really?

IN CUDA PTX, there's a special register which holds a thread's warp's index: %warpid. Now, the spec says: Note that %warpid is volatile and returns the location of a thread at the moment when read, but its value may change during execution, e.g.,…

cuda ptx

asked Jul 11 '16 at 22:11

einpoklum

118,144
57
340
684

vote

1 answer

CUDA PTX f32.f32 texture read

Is it possible to read from a CUDA texture using a floating point index directly, e.g. can I perform a texture fetch using tex.1d.v4.f32.f32. This appears to save two instructions when looking at the .ptx files and this is reflected in an increased…

cuda textures ptx

asked Jan 29 '16 at 10:02

ebarr

7,704
1
29
40

vote

0 answers

Process a CUDA Array as a single object

I have been pounding my head against a wall trying to solve this issue for about a month now and neither my C skills nor my google-fu has been strong enough to come up with a solution. One of my favorite side projects has been and continues to be…

c++ c arrays cuda ptx

asked Dec 09 '15 at 13:58

user2491607

vote

1 answer

How can I get NVVM IR (LLVM IR) from .cu - file and how to compile NVVM IR to binary?

I have a CUDA C/C++ programm for CUDA 7.5. And as known: libNVVM Library - an optimizing compiler library that generates PTX from NVVM IR. I can get PTX by using: nvcc -ptx .cu -o .ptx But how can I get NVVM IR (LLVM IR) from…

cuda nvidia llvm-ir ptx nvvm

asked Dec 09 '15 at 10:10

Alex

12,578
15
99
195

vote

1 answer

LLVM error when creating array

I want to add the following line to my llvm code (as specified in the ptx backend documentation): %1 = internal addrspace(3) global [4 x i32] [ i32 0, i32 1, i32 2, i32 3 ] However, I get the following error: expected instruction opcode @g =…

cuda llvm ptx

asked Apr 01 '14 at 14:59

PieterV

vote

2 answers

Developing PTX instead of CUDA for optimization. Is it make sense?

I'm developing cuda code. But new device languages which are PTX or SPIR backends was announced. And i can come across some application which is being developed by them. At least i think we can say ptx language is enough to develop something at…

optimization cuda parallel-processing ptx

asked Mar 04 '14 at 13:29

grypp

vote

1 answer

Using ellipsis in cuda device function

I am trying to port some C code to a cuda kernel. The code I am porting uses ellipsis prevalently. When I try to use an ellipsis in a device function like below, I get an error saying that ellipsis are not allowed in device functions. __device__…

c cuda nvcc ptx

asked Feb 25 '14 at 22:51

jim

vote

1 answer

LLVM NVPTX backend struct parameter zero size

I'm getting an obscure exception when loading the PTX assembly generated by LLVM's NVPTX backend. (I'm loading the PTX from ManagedCuda - http://managedcuda.codeplex.com/ ) ErrorNoBinaryForGPU: This indicates that there is no kernel image available…

llvm ptx

asked Jan 18 '14 at 04:35

user901037

Prev 1 2 3

…

10 11 Next