Questions tagged [ptx]

Parallel Thread Execution (PTX) is a virtual machine instruction set architecture used in Nvidia's CUDA programming environment.

nVIDIA's GPUs have differing micro-architectures, the changes between which are not always incremental (like the addition of instructions to the with successive extensions). They all, however, share an intermediate (virtual) instruction set, somewhat similar to a compiler's intermediate representation. Specifically, it is somewhat of a parallel to the OpenCL-standard-related representation, . Continuing the compilation toolchain, PTX is further compiled into one of several GPU-microarchitecture specific assembly languages () for actual execution.

Here is an example of a simple CUDA kernel and the PTX resulting from its compilation:

__global__ void square(int *array, int length) {
    int pos = threadIdx.x + blockIdx.x * blockDim.x;
    if (pos < length)
        array[pos] = array[pos] * array[pos];
}

Resulting PTX (after name demangling):

.visible .entry square(int*, int)(
        .param .u64 square(int*, int)_param_0,
        .param .u32 square(int*, int)_param_1
)
{
        ld.param.u64        %rd1, [square(int*, int)_param_0];
        ld.param.u32        %r2, [square(int*, int)_param_1];
        mov.u32             %r3, %tid.x;
        mov.u32             %r4, %ntid.x;
        mov.u32             %r5, %ctaid.x;
        mad.lo.s32          %r1, %r4, %r5, %r3;
        setp.ge.s32         %p1, %r1, %r2;
        @%p1 bra            BB0_2;

        cvta.to.global.u64  %rd2, %rd1;
        mul.wide.s32        %rd3, %r1, 4;
        add.s64             %rd4, %rd2, %rd3;
        ld.global.u32       %r6, [%rd4];
        mul.lo.s32          %r7, %r6, %r6;
        st.global.u32       [%rd4], %r7;

        ret;
}

For more information on PTX in general, and on the specific instructions and data access syntax in the example above, consult the nVIDIA PTX Referene.

164 questions
0
votes
1 answer

Linking error when using NVIDIA's static PTX compiler library & -lpthreads

I'm linking a program using NVIDIA's PTX compiler library, with a link command generated CMake, like so: usr/bin/c++ -O3 -DNDEBUG \ CMakeFiles/vectorAdd_ptx.dir/modified_cuda_samples/vectorAdd_ptx/vectorAdd_ptx.cpp.o \ -o bin/vectorAdd_ptx …
einpoklum
  • 118,144
  • 57
  • 340
  • 684
0
votes
1 answer

What does --entry take in CUDA's PTX JIT compiler?

NVIDIA's CUDA offers a PTX compilation library. One of the supported JIT compilation options for PTX code using the library is --entry entry,... (-e) which the documentation describes as: Specify the entry functions for which code must be…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
0
votes
1 answer

Is it bad that NVCC generates PTX code that is very generous with registers?

I recently read through the generated PTX code of a CUDA kernel. I realized that many registers are used to just store an intermediate value and are then never used again, and that NVCC generally seems to not care much about register re-use and…
0
votes
0 answers

Can nvcc generate an older PTX ISA version

When I compile (nvcc -ptx axpy) a short example kernel with nvcc in CUDA toolkit 11.4: __global__ void axpy(float a, float* x, float* y) { y[threadIdx.x] = a * x[threadIdx.x]; } I get this ptx: // // Generated by NVIDIA NVVM Compiler // //…
Steve Cox
  • 1,947
  • 13
  • 13
0
votes
1 answer

How Cuda compilation process takes place?

According to NVIDIAs Programming Guide: Source files for CUDA applications consist of a mixture of conventional C++ host code, plus GPU device functions. The CUDA compilation trajectory separates the device functions from the host code, compiles…
enrique
  • 1
  • 3
0
votes
1 answer

Error when compile cuda with ptx instruction 'ldmatrix' and 'mma'

I got error from the following code when I intended to use ldmatrix and mma instruction. PTX Docu says that 'ldmatrix' is introduced in PTX 6.5. So I doubt the the PTX version could be one reason. I'd like to know how do we find out which PTX…
0
votes
1 answer

Can I easily get vim to syntax-highlight CUDA PTX files?

VIM (8.1) has support for CUDA files, at least on Debian/Devuan, via vim81/syntax/cuda.vim . But - there's no such highlighting for PTX files. One idea I've had is trying to lift something from godbolt's codebase - as they do highlight their PTX…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
0
votes
1 answer

What do the %envregN special registers hold?

I've read: CUDA PTX code %envreg<32> special registers . The poster there was satisfied with not trying to treat OpenCL-originating PTX as a regular CUDA PTX. But - their question about %envN registers was not properly answered. Mark Harris wrote…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
0
votes
1 answer

Why am I getting multiple OpenCL 'binaries' when I built my program for one device?

I'm building an OpenCL program - using NVIDIA CUDA 11.2's OpenCL library (and its C++ bindings). After invoking cl::Program::build() successfully, for a single device (passing a vector with a single device index), I obtain the generated "binaries"…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
0
votes
1 answer

What does %f, %rd mean in ptx assembly

Hi I've new to CUDA programming. I've got this piece of assembly code from building a program with OpenCL. I came to wonder what those numbers and characters mean. Such as %f7, %f11, %rd3, %r3, %f, %p. I'm guessing that rd probably refers to a…
Aesop
  • 151
  • 1
  • 7
0
votes
1 answer

Why is addition without overflow set CC.CF to 1?

I have the next code #include #include #include __global__ void cuda_test() { int result; asm( ".reg .u32 r1;\n\t" "add.cc.u32 r1, 0, 0;\n\t" "subc.u32 %0, 0, 0; \n\t" :"=r"(result) …
chabapok
  • 921
  • 1
  • 8
  • 14
0
votes
1 answer

Why is this NVIDIA CUDA PTX not working as intended?

I have this code that tries to add two vectors using a handwritten PTX function: //kernel.cu #include #include int main() { CUdevice device; CUcontext context; CUmodule module; CUfunction function; char*…
0
votes
1 answer

CUDA device properties and compute capability when compiling

Let's assume I have a code which lets the user pass the threads_per_block to call the kernel. Then I want to check, if the input is valid (e.g. <=512 for compute capability CC <2.0 and 1024 for CC >=2.0). Now I wonder what would happen if I compile…
tim
  • 9,896
  • 20
  • 81
  • 137
0
votes
1 answer

ptxas complains about (types in) my sad device function

Consider the following PTX code: // // Generated by NVIDIA NVVM Compiler... sort of // // Compiler Build ID: CL-25769353 // Cuda compilation tools, release 10.1, V10.1.105 // Based on LLVM 3.4svn // .version 6.4 .target sm_30 .address_size…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
0
votes
1 answer

Simple add of vectors in Inline PTX CUDA

I Try to make simple code which will add V1(vector) with V2 and save value in V3. It is work fin in CUDA, but I can’t write it in PTX, someone could help ? __global__ void addKernelPTXv4(float4 *cc, const float4 *aa, const float4 *bb) { int i…