Questions tagged [ptx]

Parallel Thread Execution (PTX) is a virtual machine instruction set architecture used in Nvidia's CUDA programming environment.

nVIDIA's GPUs have differing micro-architectures, the changes between which are not always incremental (like the addition of instructions to the with successive extensions). They all, however, share an intermediate (virtual) instruction set, somewhat similar to a compiler's intermediate representation. Specifically, it is somewhat of a parallel to the OpenCL-standard-related representation, . Continuing the compilation toolchain, PTX is further compiled into one of several GPU-microarchitecture specific assembly languages () for actual execution.

Here is an example of a simple CUDA kernel and the PTX resulting from its compilation:

__global__ void square(int *array, int length) {
    int pos = threadIdx.x + blockIdx.x * blockDim.x;
    if (pos < length)
        array[pos] = array[pos] * array[pos];
}

Resulting PTX (after name demangling):

.visible .entry square(int*, int)(
        .param .u64 square(int*, int)_param_0,
        .param .u32 square(int*, int)_param_1
)
{
        ld.param.u64        %rd1, [square(int*, int)_param_0];
        ld.param.u32        %r2, [square(int*, int)_param_1];
        mov.u32             %r3, %tid.x;
        mov.u32             %r4, %ntid.x;
        mov.u32             %r5, %ctaid.x;
        mad.lo.s32          %r1, %r4, %r5, %r3;
        setp.ge.s32         %p1, %r1, %r2;
        @%p1 bra            BB0_2;

        cvta.to.global.u64  %rd2, %rd1;
        mul.wide.s32        %rd3, %r1, 4;
        add.s64             %rd4, %rd2, %rd3;
        ld.global.u32       %r6, [%rd4];
        mul.lo.s32          %r7, %r6, %r6;
        st.global.u32       [%rd4], %r7;

        ret;
}

For more information on PTX in general, and on the specific instructions and data access syntax in the example above, consult the nVIDIA PTX Referene.

164 questions
2
votes
1 answer

Converting CUDA .cu file to PTX file

I am having problem converting .cu to .ptx. I am using nvcc as follows: "C:\ Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\bin\nvcc" -ptx -ccbin "C:\ Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin" -o foo.ptx foo.cu The following…
Maverick
  • 61
  • 1
  • 6
2
votes
1 answer

load function parameters in inlined ptx

I have the following function with inline assembly that works fine on debug mode in 32 bit Visual Studio 2008: __device__ void add(int* pa, int* pb) { asm(".reg .u32 s<3>;"::); asm(".reg .u32 r<14>;"::); asm("ld.global.b32 s0,…
Meriko
  • 161
  • 2
  • 11
2
votes
2 answers

CUDA Expression Templates and Just in Time Compilation (JIT)

I have some questions about Just-In-Time (JIT) compilation with CUDA. I have implemented a library based on Expression Templates according to the paper J.M. Cohen, "Processing Device Arrays with C++ Metaprogramming", GPU Computing Gems - Jade…
Vitality
  • 20,705
  • 4
  • 108
  • 146
1
vote
1 answer

Can I hint to CUDA that it should move a given variable into the L1 cache?

Can I hint to CUDA that it should asynchronously move a given variable into the L1 cache? I have a deterministic data access pattern (crazy_access_order) that is unfortunately very ill-served by an LRU cache eviction policy, which I think is what…
emchristiansen
  • 3,550
  • 3
  • 26
  • 40
1
vote
1 answer

Why does NVCC not optimize away ceilf() for literals?

(Followup question for Compile-time ceiling function, for literals, in C?) Considering the following CUDA function: __device__ int foo_f() { return ceilf(1007.1111); } It should be easy to optimize this to produce a device function which simply…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
1
vote
1 answer

Disable CUDA PTX-to-binary JIT compilation

Is there a way to disable the Just-In-Time compilation of PTX code to GPU assembly when running an application? There are certain scenarios where one want to run a GPU-enabled application on CPU-only mode. If the application is big enough, and it is…
648trindade
  • 689
  • 2
  • 5
  • 21
1
vote
1 answer

Simple way to merge multiple source files into one fatbinary

To simplify the build process in a project, I'd like to compile multiple source files into device PTX code, and have all those modules in a single .fatbin file to be linked later. I can achieve this currently through either compiling each file…
brenocfg
  • 13
  • 3
1
vote
1 answer

When should NVRTC compilation produce a CUBIN?

If I understand the workflow description in the NVRTC documentation correctly, here's how it works: Create an NVRTC program from the source text. Compile the NVRTC program to get PTX code. Device-link the PTX code using NVIDIA's Driver API…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
1
vote
1 answer

Can I combine a "static" CUDA kernel launch with PTX code and get a working binary?

Suppose I take a CUDA program - for example the CUDA vectorAdd sample, and cut out the kernel's implementation, but still have the launch command: vectorAdd<<>>(d_A, d_B, d_C, numElements); and suppose that I write…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
1
vote
1 answer

How can I create an executable to run a kernel in a given PTX file?

As far as I know, you need a host code (for CPU) and a device code (for GPU), without them you can't run something on GPU. I am learning PTX ISA and I don't know how to execute it on Windows. Do I need a .cu file to run it or is there another way to…
The Mayhem
  • 23
  • 3
1
vote
1 answer

Is there a way to access value of constant memory bank in CUDA

I have been trying to debug cuda programs that use inline PTX assembly. Specifically, I am debugging at the instruction level, and am trying to determine the values of arguments to the instructions. Occasionally, the disassembly includes a…
Benjie
  • 113
  • 10
1
vote
1 answer

Using nvdisasm to generate control flow image of PTX code

I have a single file of CUDA code compiled to intermediate language PTX code, example.ptx. I would be interested to start poking around with this short file, trying to understand how it works. I don't have previous experience fiddling around with…
Toothery
  • 155
  • 1
  • 1
  • 10
1
vote
1 answer

Understanding Performance Behavior of Random Writes to Global Memory

I'm running experiments aiming to understand the behavior of random read and write access to global memory. The following kernel reads from an input vector (groupColumn) with a coalesced access pattern and reads random entries from a hash table in…
1
vote
1 answer

When is the (default-variant) PTX instruction `prmt` useful?

PTX has a prmt instruction with many variants. This question regards the default one, which, if formatted as a C/C++ function, would look like this: uint32_t prmt(uint32_t a, uint32_t b, uint32_t byte_selectors); and this is what it does (adapted…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
1
vote
2 answers

How should I get CMake to also create PTX files for my kernels

I'm building a project with CUDA code, using a recent CMake which has intrinsic CUDA support (version >= 3.8 or later if necessary). How do I tell CMake to (also) generate PTX files for my various kernels? Something I've tried which doesn't (?)…
einpoklum
  • 118,144
  • 57
  • 340
  • 684