Questions tagged [ptx]

Parallel Thread Execution (PTX) is a virtual machine instruction set architecture used in Nvidia's CUDA programming environment.

nVIDIA's GPUs have differing micro-architectures, the changes between which are not always incremental (like the addition of instructions to the with successive extensions). They all, however, share an intermediate (virtual) instruction set, somewhat similar to a compiler's intermediate representation. Specifically, it is somewhat of a parallel to the OpenCL-standard-related representation, . Continuing the compilation toolchain, PTX is further compiled into one of several GPU-microarchitecture specific assembly languages () for actual execution.

Here is an example of a simple CUDA kernel and the PTX resulting from its compilation:

__global__ void square(int *array, int length) {
    int pos = threadIdx.x + blockIdx.x * blockDim.x;
    if (pos < length)
        array[pos] = array[pos] * array[pos];
}

Resulting PTX (after name demangling):

.visible .entry square(int*, int)(
        .param .u64 square(int*, int)_param_0,
        .param .u32 square(int*, int)_param_1
)
{
        ld.param.u64        %rd1, [square(int*, int)_param_0];
        ld.param.u32        %r2, [square(int*, int)_param_1];
        mov.u32             %r3, %tid.x;
        mov.u32             %r4, %ntid.x;
        mov.u32             %r5, %ctaid.x;
        mad.lo.s32          %r1, %r4, %r5, %r3;
        setp.ge.s32         %p1, %r1, %r2;
        @%p1 bra            BB0_2;

        cvta.to.global.u64  %rd2, %rd1;
        mul.wide.s32        %rd3, %r1, 4;
        add.s64             %rd4, %rd2, %rd3;
        ld.global.u32       %r6, [%rd4];
        mul.lo.s32          %r7, %r6, %r6;
        st.global.u32       [%rd4], %r7;

        ret;
}

For more information on PTX in general, and on the specific instructions and data access syntax in the example above, consult the nVIDIA PTX Referene.

164 questions
1
vote
2 answers

Is inline PTX more efficient than C/C++ code?

I have noticed that PTX code allows for some instructions with complex semantics, such as bit field extract (bfe), find most-significant non-sign bit (bfind), and population count (popc). Is it more efficient to use them explicitly rather than write…
luxuia
  • 3,459
  • 1
  • 12
  • 8
1
vote
3 answers

type casting to unsigned long long in CUDA?

Basically what I want is an function works like hiloint2uint64(), just join two 32 bit integer and reinterpret the outcome as an uint64. I cannot find any function in CUDA that can do this, anyhow, is there any ptx code that can do that kind of type…
user2188453
  • 1,105
  • 1
  • 12
  • 26
1
vote
2 answers

Bug in PTX ISA (carry propagation)?

Is there a bug in Cuda? I have run the following code on my GTX580 and r1 is zero at the end. I expect that it is one due to carry propagation? I have tested the code with Cuda Toolkit 4.2.9 and 5.5 and use "nvcc -arch=sm_20 bug.cu -o bug && ./bug"…
user4811
  • 183
  • 1
  • 11
1
vote
1 answer

LLVM NVPTX Backend with CUDA5 Dynamic Parallism

Does LLVM's NVPTX backend (contributed by NVIDIA) have any support for the new Dynamic Parallelism feature found in CUDA5 / Compute Capability 3.5 devices?
n00b101
  • 265
  • 1
  • 10
1
vote
1 answer

cuModuleLoadDataEx ignores all options

This question is similar to cuModuleLoadDataEx options but I would like to bring the topic up again and in addition provide more information. When loading a PTX string with the NV driver via cuModuleLoadDataEx it seems to ignore all options all…
ritter
  • 7,447
  • 7
  • 51
  • 84
1
vote
1 answer

Overloading the CUDA shuffle function makes the original ones invisible

I'm trying to implement my own 64-bit shuffle function in CUDA. However, if I do it like this: static __inline__ __device__ double __shfl_xor(double var, int laneMask, int width=warpSize) { int hi, lo; asm volatile( "mov.b64 { %0, %1 }, %2;"…
Rainn
  • 315
  • 1
  • 9
1
vote
1 answer

how to find the active SMs?

Is there any way by which I can know the number of free/active SMs? Or atleast to read the voltage/power or temperature values of each SM by which I can know whether its working or not? (in real time while some job is getting executed on the gpu…
Rakesh Kumar
  • 51
  • 1
  • 7
1
vote
1 answer

Missing CUDA inline PTX constraint letter for 8 bit variables in order to disable L1 cache for 8 bit variable (bool)

INTRODUCTION In this question we can learn how to disable L1 cache for one single variable. Here is the accepted answer: As mentioned above you can use inline PTX, here is an example: __device__ __inline__ double ld_gbl_cg(const double *addr) { …
Sam
  • 27
  • 6
1
vote
1 answer

CUDA __float_as_int in acosf implementation

CUDA C's maths function implementation (cuda/math_function.h) of acosf contains the passage: if (__float_as_int(a) < 0) { t1 = CUDART_PI_F - t1; } where a and t1 are floats and CUDART_PI_F is a float previously set to a numerical value close to…
ritter
  • 7,447
  • 7
  • 51
  • 84
1
vote
1 answer

Syntax on inline PTX code for CUDA

As written in Nvidia's Inline PTX Assembly document, the grammar for using inline assembly is: asm("temp_string" : "constraint"(output) : "constraint"(input)); Here are two examples: asm("vadd.s32.s32.s32 %0, %1.h0, %2.h0;" : "=r"(v) : "r"(a),…
gpunerd
  • 151
  • 9
0
votes
0 answers

Using NVCC-generated PTX file in OpenCL

I'm experimenting with using nvcc generated PTX file in an OpenCL-based host program. A simple testing kernel is written in both CUDA C and OpenCL C shown below. // exampleKernel_cuda.cu extern "C" { __global__ void exampleKernel(float *input,…
MCx
  • 1
0
votes
1 answer

Confusion about __cvta_generic_to_shared

Nvidia Ampere GPU support feature: cp async from global mem to shared mem bypass L1 and register file. The corresponding PTX core is cp.async. Why need __cvta_generic_to_shared to convert a shared mem ptr (T*) to size_t? size_t smem_a_converted =…
0
votes
0 answers

How can I run LLVM custom passes on the LLVM IR file of a CUDA code using opt?

I am having problem with running LLVM custom passes on the ".ll" file of a CUDA code. For example, I have a CUDA sample code named sample1.cu. I have compiled the CUDA code using the following command. ./bin/clang++ -flegacy-pass-manager -g -Xclang…
Tauro
  • 59
  • 7
0
votes
0 answers

Optix LNK2001 unresolved external symbol

I'm trying to follow a step by step tutorial on Optix at github, but in one file there's this line: extern "C" char embedded_ptx_code[]; which leads to the error at the title since there's no definition of that C string. I don't get it , am I…
mrmanet
  • 3
  • 2
0
votes
1 answer

How to get instruction cost in NVIDIA GPU?

I want to know about how many clock instruction cost in nvidia gpu, such as add, mul,ld/st and so on, How can I do ? I had wrote some code to test and run in 2080Ti const int test_cnt = 1000; auto lastT = clock64(); uint32_t…
sorfkc
  • 13
  • 4