Questions tagged [ptx]

Parallel Thread Execution (PTX) is a virtual machine instruction set architecture used in Nvidia's CUDA programming environment.

nVIDIA's GPUs have differing micro-architectures, the changes between which are not always incremental (like the addition of instructions to the x86-64 with successive avx extensions). They all, however, share an intermediate (virtual) instruction set, somewhat similar to a compiler's intermediate representation. Specifically, it is somewhat of a parallel to the OpenCL-standard-related representation, spir-v. Continuing the compilation toolchain, PTX is further compiled into one of several GPU-microarchitecture specific assembly languages (sass) for actual execution.

Here is an example of a simple CUDA kernel and the PTX resulting from its compilation:

__global__ void square(int *array, int length) {
    int pos = threadIdx.x + blockIdx.x * blockDim.x;
    if (pos < length)
        array[pos] = array[pos] * array[pos];
}

Resulting PTX (after name demangling):

.visible .entry square(int*, int)(
        .param .u64 square(int*, int)_param_0,
        .param .u32 square(int*, int)_param_1
)
{
        ld.param.u64        %rd1, [square(int*, int)_param_0];
        ld.param.u32        %r2, [square(int*, int)_param_1];
        mov.u32             %r3, %tid.x;
        mov.u32             %r4, %ntid.x;
        mov.u32             %r5, %ctaid.x;
        mad.lo.s32          %r1, %r4, %r5, %r3;
        setp.ge.s32         %p1, %r1, %r2;
        @%p1 bra            BB0_2;

        cvta.to.global.u64  %rd2, %rd1;
        mul.wide.s32        %rd3, %r1, 4;
        add.s64             %rd4, %rd2, %rd3;
        ld.global.u32       %r6, [%rd4];
        mul.lo.s32          %r7, %r6, %r6;
        st.global.u32       [%rd4], %r7;

        ret;
}

For more information on PTX in general, and on the specific instructions and data access syntax in the example above, consult the nVIDIA PTX Referene.

164 questions

votes

1 answer

Converting CUDA .cu file to PTX file

I am having problem converting .cu to .ptx. I am using nvcc as follows: "C:\ Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\bin\nvcc" -ptx -ccbin "C:\ Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin" -o foo.ptx foo.cu The following…

windows visual-studio cuda ptx

asked May 15 '13 at 11:54

Maverick

votes

1 answer

load function parameters in inlined ptx

I have the following function with inline assembly that works fine on debug mode in 32 bit Visual Studio 2008: __device__ void add(int* pa, int* pb) { asm(".reg .u32 s<3>;"::); asm(".reg .u32 r<14>;"::); asm("ld.global.b32 s0,…

visual-studio-2008 cuda ptx

asked Apr 16 '13 at 16:23

Meriko

votes

2 answers

CUDA Expression Templates and Just in Time Compilation (JIT)

I have some questions about Just-In-Time (JIT) compilation with CUDA. I have implemented a library based on Expression Templates according to the paper J.M. Cohen, "Processing Device Arrays with C++ Metaprogramming", GPU Computing Gems - Jade…

dll cuda expression jit ptx

asked Apr 08 '13 at 14:41

Vitality

20,705
4
108
146

vote

1 answer

Can I hint to CUDA that it should move a given variable into the L1 cache?

Can I hint to CUDA that it should asynchronously move a given variable into the L1 cache? I have a deterministic data access pattern (crazy_access_order) that is unfortunately very ill-served by an LRU cache eviction policy, which I think is what…

cuda ptx

asked Oct 13 '22 at 20:31

emchristiansen

3,550
3
26
40

vote

1 answer

Why does NVCC not optimize away ceilf() for literals?

(Followup question for Compile-time ceiling function, for literals, in C?) Considering the following CUDA function: __device__ int foo_f() { return ceilf(1007.1111); } It should be easy to optimize this to produce a device function which simply…

floating-point cuda compiler-optimization ceil ptx

asked Feb 13 '22 at 22:14

einpoklum

118,144
57
340
684

vote

1 answer

Disable CUDA PTX-to-binary JIT compilation

Is there a way to disable the Just-In-Time compilation of PTX code to GPU assembly when running an application? There are certain scenarios where one want to run a GPU-enabled application on CPU-only mode. If the application is big enough, and it is…

cuda ptx

asked Nov 25 '21 at 18:10

648trindade

vote

1 answer

Simple way to merge multiple source files into one fatbinary

To simplify the build process in a project, I'd like to compile multiple source files into device PTX code, and have all those modules in a single .fatbin file to be linked later. I can achieve this currently through either compiling each file…

cuda nvcc ptx

asked Nov 24 '21 at 13:11

brenocfg

vote

1 answer

When should NVRTC compilation produce a CUBIN?

If I understand the workflow description in the NVRTC documentation correctly, here's how it works: Create an NVRTC program from the source text. Compile the NVRTC program to get PTX code. Device-link the PTX code using NVIDIA's Driver API…

cuda linker ptx nvrtc cubin

asked Oct 30 '21 at 21:17

einpoklum

118,144
57
340
684

vote

1 answer

Can I combine a "static" CUDA kernel launch with PTX code and get a working binary?

Suppose I take a CUDA program - for example the CUDA vectorAdd sample, and cut out the kernel's implementation, but still have the launch command: vectorAdd<<>>(d_A, d_B, d_C, numElements); and suppose that I write…

cuda toolchain ptx fat-binaries cubin

asked May 29 '21 at 21:05

einpoklum

118,144
57
340
684

vote

1 answer

How can I create an executable to run a kernel in a given PTX file?

As far as I know, you need a host code (for CPU) and a device code (for GPU), without them you can't run something on GPU. I am learning PTX ISA and I don't know how to execute it on Windows. Do I need a .cu file to run it or is there another way to…

build cuda ptx

asked May 29 '21 at 15:05

The Mayhem

vote

1 answer

Is there a way to access value of constant memory bank in CUDA

I have been trying to debug cuda programs that use inline PTX assembly. Specifically, I am debugging at the instruction level, and am trying to determine the values of arguments to the instructions. Occasionally, the disassembly includes a…

cuda ptx cuda-gdb

asked Apr 01 '21 at 19:50

Benjie

vote

1 answer

Using nvdisasm to generate control flow image of PTX code

I have a single file of CUDA code compiled to intermediate language PTX code, example.ptx. I would be interested to start poking around with this short file, trying to understand how it works. I don't have previous experience fiddling around with…

parallel-processing cuda gpu ptx intermediate-language

asked Jan 29 '21 at 10:41

Toothery

vote

1 answer

Understanding Performance Behavior of Random Writes to Global Memory

I'm running experiments aiming to understand the behavior of random read and write access to global memory. The following kernel reads from an input vector (groupColumn) with a coalesced access pattern and reads random entries from a hash table in…

opencl gpu ptx

asked Dec 08 '20 at 10:22

Christoph Schulze

vote

1 answer

When is the (default-variant) PTX instruction `prmt` useful?

PTX has a prmt instruction with many variants. This question regards the default one, which, if formatted as a C/C++ function, would look like this: uint32_t prmt(uint32_t a, uint32_t b, uint32_t byte_selectors); and this is what it does (adapted…

cuda operators nvidia ptx

asked Feb 17 '20 at 13:21

einpoklum

118,144
57
340
684

vote

2 answers

How should I get CMake to also create PTX files for my kernels

I'm building a project with CUDA code, using a recent CMake which has intrinsic CUDA support (version >= 3.8 or later if necessary). How do I tell CMake to (also) generate PTX files for my various kernels? Something I've tried which doesn't (?)…

build cmake cuda ptx

asked May 05 '19 at 15:47

einpoklum

118,144
57
340
684

Prev 1 2 3

…

10 11 Next