Questions tagged [ptx]

Parallel Thread Execution (PTX) is a virtual machine instruction set architecture used in Nvidia's CUDA programming environment.

nVIDIA's GPUs have differing micro-architectures, the changes between which are not always incremental (like the addition of instructions to the with successive extensions). They all, however, share an intermediate (virtual) instruction set, somewhat similar to a compiler's intermediate representation. Specifically, it is somewhat of a parallel to the OpenCL-standard-related representation, . Continuing the compilation toolchain, PTX is further compiled into one of several GPU-microarchitecture specific assembly languages () for actual execution.

Here is an example of a simple CUDA kernel and the PTX resulting from its compilation:

__global__ void square(int *array, int length) {
    int pos = threadIdx.x + blockIdx.x * blockDim.x;
    if (pos < length)
        array[pos] = array[pos] * array[pos];
}

Resulting PTX (after name demangling):

.visible .entry square(int*, int)(
        .param .u64 square(int*, int)_param_0,
        .param .u32 square(int*, int)_param_1
)
{
        ld.param.u64        %rd1, [square(int*, int)_param_0];
        ld.param.u32        %r2, [square(int*, int)_param_1];
        mov.u32             %r3, %tid.x;
        mov.u32             %r4, %ntid.x;
        mov.u32             %r5, %ctaid.x;
        mad.lo.s32          %r1, %r4, %r5, %r3;
        setp.ge.s32         %p1, %r1, %r2;
        @%p1 bra            BB0_2;

        cvta.to.global.u64  %rd2, %rd1;
        mul.wide.s32        %rd3, %r1, 4;
        add.s64             %rd4, %rd2, %rd3;
        ld.global.u32       %r6, [%rd4];
        mul.lo.s32          %r7, %r6, %r6;
        st.global.u32       [%rd4], %r7;

        ret;
}

For more information on PTX in general, and on the specific instructions and data access syntax in the example above, consult the nVIDIA PTX Referene.

164 questions
7
votes
1 answer

How to create LLVM structure value?

I'm trying to create an LLVM value of a structure type. I'm using LLVM-C interface and find a function: LLVMValueRef LLVMConstStruct (LLVMValueRef *ConstantVals, unsigned Count, LLVMBool Packed) This works fine if all members are constant value…
Xiang Zhang
  • 2,831
  • 20
  • 40
7
votes
1 answer

PTX "bit bucket" registers

...are just mentioned in the PTX manual. There is no hint about what they are good for or how to use them. Does anyone know more? Am I just missing a common concept?
Dude
  • 583
  • 2
  • 9
6
votes
1 answer

CUDA: compilation of LLVM IR using NVPTX

For my project, I am generating PTX instructions for some functions in two different ways. The first method uses CUDA C to implement the functions and nvcc to compile them, using nvcc -ptx .cu -o .ptx. The other method writes code in…
PieterV
  • 816
  • 10
  • 23
6
votes
1 answer

CUDA/PTX 32-bit vs. 64-bit

CUDA compilers have options for producing 32-bit or 64-bit PTX. What is the difference between these? Is it like for x86, NVidia GPUs actually have 32-bit and 64-bit ISAs? Or is it related to host code only?
zlatanski
  • 815
  • 1
  • 8
  • 13
6
votes
1 answer

cuda - minimal example, high register usage

Consider these 3 trivial, minimal kernels. Their register usage is much higher than I expect. Why? A: __global__ void Kernel_A() { //empty } corresponding ptx: ptxas info : Compiling entry function '_Z8Kernel_Av' for 'sm_20' ptxas info :…
cmo
  • 3,762
  • 4
  • 36
  • 64
6
votes
2 answers

Passing the PTX program to the CUDA driver directly

The CUDA driver API provides loading the file containing PTX code from the filesystem. One usually does the following: CUmodule module; CUfunction function; const char* module_file = "my_prg.ptx"; const char* kernel_name = "vector_add"; err =…
ritter
  • 7,447
  • 7
  • 51
  • 84
5
votes
1 answer

NVPTX generic memory space location in architecture

In NVPTX(LLVM IR) for CUDA programs, there're identifiers for memory address space from 0 to 5 (see Table below). I saw in the same LLVM IR program, memory addresses are identified as 'Generic' or other types as shown in pictures. For 'Generic' (by…
cache
  • 1,239
  • 3
  • 13
  • 21
5
votes
1 answer

Portable way of accessing an array of bool (C++11) from Nvidia PTX

I need to do GPU computations on an boolean array bool[] (note, not a std::vector) which was created in CPU memory (with C++11 code) and then copied to the GPU via cuMemCpy or similar. First question: sizeof(bool) reports 1 byte. Is this…
ritter
  • 7,447
  • 7
  • 51
  • 84
4
votes
1 answer

The meaning of brackets around register in PTX assembly loads/stores

Below is an apparently legitimate PTX assembly code produced by Triton compiler. I'm puzzled by { %r1 } and { %r2 } used in load and store instructions. According to the PTX ISA documentation, it looks like an initializer list. But it does not make…
Dmitry Mikushin
  • 1,478
  • 15
  • 16
4
votes
1 answer

Ray triangle intersection intrinsic in CUDA and other OptiX components

Is there any way to directly use hardware accelerated ray triangle intersection in CUDA without using OptiX? This is analogous to how it is possible to use tensor cores directly in CUDA for small matrix multiplication, see the Programmatic Access to…
4
votes
1 answer

How to configure GCC for OpenMP 4.5 offloading to Nvidia PTX GPGPUs

With gcc 7.1 released, we can now configure gcc for openmp 4.5, offloading to Nvidia PTX GPGPUs. That's what they say in the release note (approximately). So my question is, is there any special flags to activate this configuration when compiling…
chedy najjar
  • 631
  • 7
  • 19
4
votes
1 answer

What is the difference between prefetch and prefetchu ptx instructions?

Here in the documentation, it is stated that prefetch and prefetchu ptx instructions "prefetch line containing a generic address at a specified level of memory hierarchy, in specified state space". It is also mentioned that the syntax is…
Farzad
  • 3,288
  • 2
  • 29
  • 53
4
votes
2 answers

compile constant memory array to immediate value in CUDA

I am writing a code to approximate a function using power series and would like to exploit #pragma unroll and FMA instruction, like this: __constant__ double coeff[5] = {1.0,2.0,3.0,4.0,5.0}; /* constant is fake here */ __device__ double…
wh0
  • 510
  • 1
  • 6
  • 19
4
votes
1 answer

Learning PTX from scratch

I'd like to start learning PTX, where should I start? Is there any good book/resource to do this? I already know x86/x64 ASM (more or less) if this might help
Marco A.
  • 43,032
  • 26
  • 132
  • 246
4
votes
1 answer

cuda: device function inlining and different .cu files

Two facts: CUDA 5.0 lets you compile CUDA code in different objects files for linking later on. CUDA architecture 2.x no longer inlines functions automatically. As usual in C/C++, I've implemented a function __device__ int foo() in functions.cu…
cmo
  • 3,762
  • 4
  • 36
  • 64
1
2
3
10 11