Questions tagged [ptx]

Parallel Thread Execution (PTX) is a virtual machine instruction set architecture used in Nvidia's CUDA programming environment.

nVIDIA's GPUs have differing micro-architectures, the changes between which are not always incremental (like the addition of instructions to the x86-64 with successive avx extensions). They all, however, share an intermediate (virtual) instruction set, somewhat similar to a compiler's intermediate representation. Specifically, it is somewhat of a parallel to the OpenCL-standard-related representation, spir-v. Continuing the compilation toolchain, PTX is further compiled into one of several GPU-microarchitecture specific assembly languages (sass) for actual execution.

Here is an example of a simple CUDA kernel and the PTX resulting from its compilation:

__global__ void square(int *array, int length) {
    int pos = threadIdx.x + blockIdx.x * blockDim.x;
    if (pos < length)
        array[pos] = array[pos] * array[pos];
}

Resulting PTX (after name demangling):

.visible .entry square(int*, int)(
        .param .u64 square(int*, int)_param_0,
        .param .u32 square(int*, int)_param_1
)
{
        ld.param.u64        %rd1, [square(int*, int)_param_0];
        ld.param.u32        %r2, [square(int*, int)_param_1];
        mov.u32             %r3, %tid.x;
        mov.u32             %r4, %ntid.x;
        mov.u32             %r5, %ctaid.x;
        mad.lo.s32          %r1, %r4, %r5, %r3;
        setp.ge.s32         %p1, %r1, %r2;
        @%p1 bra            BB0_2;

        cvta.to.global.u64  %rd2, %rd1;
        mul.wide.s32        %rd3, %r1, 4;
        add.s64             %rd4, %rd2, %rd3;
        ld.global.u32       %r6, [%rd4];
        mul.lo.s32          %r7, %r6, %r6;
        st.global.u32       [%rd4], %r7;

        ret;
}

For more information on PTX in general, and on the specific instructions and data access syntax in the example above, consult the nVIDIA PTX Referene.

164 questions

votes

1 answer

How to create LLVM structure value?

I'm trying to create an LLVM value of a structure type. I'm using LLVM-C interface and find a function: LLVMValueRef LLVMConstStruct (LLVMValueRef *ConstantVals, unsigned Count, LLVMBool Packed) This works fine if all members are constant value…

struct cuda llvm ptx

asked Apr 03 '13 at 11:07

Xiang Zhang

2,831
20
40

votes

1 answer

PTX "bit bucket" registers

...are just mentioned in the PTX manual. There is no hint about what they are good for or how to use them. Does anyone know more? Am I just missing a common concept?

assembly cuda gpgpu ptx

asked Oct 18 '12 at 01:55

Dude

votes

1 answer

CUDA: compilation of LLVM IR using NVPTX

For my project, I am generating PTX instructions for some functions in two different ways. The first method uses CUDA C to implement the functions and nvcc to compile them, using nvcc -ptx .cu -o .ptx. The other method writes code in…

cuda llvm ptx

asked May 26 '14 at 15:10

PieterV

votes

1 answer

CUDA/PTX 32-bit vs. 64-bit

CUDA compilers have options for producing 32-bit or 64-bit PTX. What is the difference between these? Is it like for x86, NVidia GPUs actually have 32-bit and 64-bit ISAs? Or is it related to host code only?

cuda nvcc ptx

asked Dec 17 '13 at 02:05

zlatanski

votes

1 answer

cuda - minimal example, high register usage

Consider these 3 trivial, minimal kernels. Their register usage is much higher than I expect. Why? A: __global__ void Kernel_A() { //empty } corresponding ptx: ptxas info : Compiling entry function '_Z8Kernel_Av' for 'sm_20' ptxas info :…

optimization assembly cuda gpu ptx

asked Jun 20 '13 at 14:16

cmo

3,762
4
36
64

votes

2 answers

Passing the PTX program to the CUDA driver directly

The CUDA driver API provides loading the file containing PTX code from the filesystem. One usually does the following: CUmodule module; CUfunction function; const char* module_file = "my_prg.ptx"; const char* kernel_name = "vector_add"; err =…

c cuda ptx

asked Apr 05 '13 at 20:00

ritter

7,447
7
51
84

votes

1 answer

NVPTX generic memory space location in architecture

In NVPTX(LLVM IR) for CUDA programs, there're identifiers for memory address space from 0 to 5 (see Table below). I saw in the same LLVM IR program, memory addresses are identified as 'Generic' or other types as shown in pictures. For 'Generic' (by…

architecture cuda llvm llvm-ir ptx

asked Sep 08 '15 at 23:51

cache

1,239
3
13
21

votes

1 answer

Portable way of accessing an array of bool (C++11) from Nvidia PTX

I need to do GPU computations on an boolean array bool[] (note, not a std::vector) which was created in CPU memory (with C++11 code) and then copied to the GPU via cuMemCpy or similar. First question: sizeof(bool) reports 1 byte. Is this…

c++ c++11 boolean gpu ptx

asked Oct 15 '12 at 10:44

ritter

7,447
7
51
84

votes

1 answer

The meaning of brackets around register in PTX assembly loads/stores

Below is an apparently legitimate PTX assembly code produced by Triton compiler. I'm puzzled by { %r1 } and { %r2 } used in load and store instructions. According to the PTX ISA documentation, it looks like an initializer list. But it does not make…

assembly cuda nvidia ptx triton

asked Aug 31 '23 at 10:24

Dmitry Mikushin

1,478
15
16

votes

1 answer

Ray triangle intersection intrinsic in CUDA and other OptiX components

Is there any way to directly use hardware accelerated ray triangle intersection in CUDA without using OptiX? This is analogous to how it is possible to use tensor cores directly in CUDA for small matrix multiplication, see the Programmatic Access to…

cuda raytracing ptx optix

asked Dec 31 '20 at 17:37

Ryan Greenblatt

votes

1 answer

How to configure GCC for OpenMP 4.5 offloading to Nvidia PTX GPGPUs

With gcc 7.1 released, we can now configure gcc for openmp 4.5, offloading to Nvidia PTX GPGPUs. That's what they say in the release note (approximately). So my question is, is there any special flags to activate this configuration when compiling…

gcc openmp nvidia ptx offloading

asked May 04 '17 at 16:59

chedy najjar

votes

1 answer

What is the difference between prefetch and prefetchu ptx instructions?

Here in the documentation, it is stated that prefetch and prefetchu ptx instructions "prefetch line containing a generic address at a specified level of memory hierarchy, in specified state space". It is also mentioned that the syntax is…

cuda gpu gpgpu prefetch ptx

asked Apr 06 '15 at 00:03

Farzad

3,288
2
29
53

votes

2 answers

compile constant memory array to immediate value in CUDA

I am writing a code to approximate a function using power series and would like to exploit #pragma unroll and FMA instruction, like this: __constant__ double coeff[5] = {1.0,2.0,3.0,4.0,5.0}; /* constant is fake here */ __device__ double…

cuda gpu ptx

asked Jan 20 '15 at 09:16

wh0

votes

1 answer

Learning PTX from scratch

I'd like to start learning PTX, where should I start? Is there any good book/resource to do this? I already know x86/x64 ASM (more or less) if this might help

cuda nvidia ptx

asked Jul 09 '13 at 13:30

Marco A.

43,032
26
132
246

votes

1 answer

cuda: device function inlining and different .cu files

Two facts: CUDA 5.0 lets you compile CUDA code in different objects files for linking later on. CUDA architecture 2.x no longer inlines functions automatically. As usual in C/C++, I've implemented a function __device__ int foo() in functions.cu…

cuda gpu inline nvidia ptx

asked Jun 13 '13 at 18:14

cmo

3,762
4
36
64

Prev 1

…

10 11 Next