Questions tagged [ptx]

Parallel Thread Execution (PTX) is a virtual machine instruction set architecture used in Nvidia's CUDA programming environment.

nVIDIA's GPUs have differing micro-architectures, the changes between which are not always incremental (like the addition of instructions to the x86-64 with successive avx extensions). They all, however, share an intermediate (virtual) instruction set, somewhat similar to a compiler's intermediate representation. Specifically, it is somewhat of a parallel to the OpenCL-standard-related representation, spir-v. Continuing the compilation toolchain, PTX is further compiled into one of several GPU-microarchitecture specific assembly languages (sass) for actual execution.

Here is an example of a simple CUDA kernel and the PTX resulting from its compilation:

__global__ void square(int *array, int length) {
    int pos = threadIdx.x + blockIdx.x * blockDim.x;
    if (pos < length)
        array[pos] = array[pos] * array[pos];
}

Resulting PTX (after name demangling):

.visible .entry square(int*, int)(
        .param .u64 square(int*, int)_param_0,
        .param .u32 square(int*, int)_param_1
)
{
        ld.param.u64        %rd1, [square(int*, int)_param_0];
        ld.param.u32        %r2, [square(int*, int)_param_1];
        mov.u32             %r3, %tid.x;
        mov.u32             %r4, %ntid.x;
        mov.u32             %r5, %ctaid.x;
        mad.lo.s32          %r1, %r4, %r5, %r3;
        setp.ge.s32         %p1, %r1, %r2;
        @%p1 bra            BB0_2;

        cvta.to.global.u64  %rd2, %rd1;
        mul.wide.s32        %rd3, %r1, 4;
        add.s64             %rd4, %rd2, %rd3;
        ld.global.u32       %r6, [%rd4];
        mul.lo.s32          %r7, %r6, %r6;
        st.global.u32       [%rd4], %r7;

        ret;
}

For more information on PTX in general, and on the specific instructions and data access syntax in the example above, consult the nVIDIA PTX Referene.

164 questions

votes

1 answer

PTX - where are .reg registers located?

When I use .reg to declare registers.. where are they? I mean: if I use .reg inside a device function registers are stored on the register file that each thread has... but what if I declare a .reg variable in the module in the global scope (not…

cuda nvidia ptx

asked Jul 16 '13 at 13:32

Marco A.

43,032
26
132
246

votes

1 answer

PTX - difference between .local and .param

I'm studying PTX and I don't understand the difference between .param and .local state spaces. .local are variables visible to threads and stored on their stack (which is, by the way, thread memory) .param are variables used for object allocation…

cuda nvidia ptx

asked Jul 15 '13 at 10:04

Marco A.

43,032
26
132
246

votes

1 answer

PTX - get value/address

I don't understand how the mov instruction works in PTX.. mov.type d, a this moves a in d if a is a register or immediate value. By the way this can move into d the address of a if a is a variable in global, local or shared state space. Let's…

cuda nvidia ptx

asked Jul 11 '13 at 16:05

Marco A.

43,032
26
132
246

votes

1 answer

Using SIMD video instructions in inline ptx assembly CUDA

I want to use the SIMD video instructions (vadd4, vmax4 etc.) Section 8.7.13 in http://docs.nvidia.com/cuda/pdf/ptx_isa_3.1.pdf I tried the following in my code asm("vadd4.u32.u32.u32 %0, %1, %2, %3;" : "=r"(i) : "r"(j) : "r"(k) : "r"(l)); where…

assembly cuda simd ptx

asked Jun 25 '13 at 02:39

user2494882

votes

2 answers

"Unexpected address space" compilation error while using shared memory in PTX

I have written a trivial kernel in which I declare my shared memory array as extern __shared__ float As[100]; In my kernel launch I specify the number_of_bytes of shared memory. I get the error "Unexpected address space" while compiling the…

cuda llvm ptx

asked Mar 07 '13 at 06:55

lucent

votes

1 answer

A Method of counting Floating Point Operations in a C++/CUDA Program using PTX

I have a somewhat large CUDA application and I need to calculate the attained GFLOPs. I'm looking for an easy and perhaps generic way of counting the number of floating point operations. Is it possible to count floating point operations from the…

c++ cuda ptx

asked Feb 11 '13 at 12:58

Francis Saa-Dittoh

votes

1 answer

Using textures in Cuda when kernel code is in PTX file and Host code also generates PTX file

I am having trouble getting texture read to work using Cuda [4.2] on Windows. My program reads a ptx file containing all the kernel modules. In addition the compilation process spits out an additional ptx file from a short routine of Host code. …

cuda textures gpu ptx

asked Feb 08 '13 at 14:53

JPM

votes

1 answer

PTX to target translation in OpenCl

In OpenCL the "PTX (like bytecode in java) to target converter" is an interpreter (like we have for bytecode in java) or an ahead-of-time assembler?

java opencl bytecode ptx

asked Dec 30 '12 at 16:10

gpuguy

4,607
17
67
125

votes

1 answer

PTX arrays as operands not working

The PTX manual (version 2.3) (http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/ptx_isa_2.3.pdf) 6.4.2 states: Array elements can be accessed using an explicitly calculated byte address, or by indexing into the array using…

arrays cuda ptx

asked Nov 09 '12 at 09:53

ritter

7,447
7
51
84

-1

votes

2 answers

Why does PTX shows 32 bit load operation for a 128 bit struct assignment?

I defined custom structs of 128 bits like this- typedef struct dtype{ int val; int temp2; int temp3; int temp4; }dtype; Then I performed an assignment :- dtype temp= h_a[i]; //where h_a is dtype * I was expecting a 128 bit load but instead PTX…

cuda gpu ptx

asked Sep 17 '20 at 10:18

Abhishek Tyagi

-1

votes

1 answer

PTX command for reading specific amount of bit at a time

I understand that this is achievable by using a loop with bit manipulation functions, however I wonder if there is faster and more direct way of doing this. I am using about 64 bits as a header for my data structure which may be many different…

cuda ptx

asked May 26 '16 at 09:59

SailingOnThoughts

-2

votes

2 answers

PyTorch CUDA : the provided PTX was compiled with an unsupported toolchain

I am using Nvidia V100 with the following specs: (pytorch) [s.1915438@cl1 aneurysm]$ srun nvidia-smi Sun Jul 17 16:17:27 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 495.29.05 Driver…

pytorch cuda ptx

asked Jul 17 '22 at 15:26

Prakhar Sharma

-2

votes

1 answer

Why Pytorch 1.7 with cuda10.1 cannot compatible with Nvidia A100 Ampere Architecture (according to PTX compatibilty pricinple)

According to Nvidia official documentation, if CUDA appliation is built to include PTX, because the PTX is forward-compatible, Meaning PTX is supported to run on any GPU with compute capability higher than the compute capability assumed for…

pytorch cuda gpu ptx

asked Mar 03 '22 at 06:46

Seven link bob

-2

votes

1 answer

Raise x to power of y in ptx nvidia cuda (assembly)

I want to raise x to the power of y in ptx. Nvidia has a function ex2 which calculates 2^x and lg2 which calculates log2x but there's no function for calculating x^y. Is there more clever and simpler solution that multiplying value in loop? How…

cuda nvidia ptx

asked Jan 13 '19 at 12:33

karlosos

1,034
9
25

Prev 1 2 3

…