Questions tagged [ptx]

Parallel Thread Execution (PTX) is a virtual machine instruction set architecture used in Nvidia's CUDA programming environment.

nVIDIA's GPUs have differing micro-architectures, the changes between which are not always incremental (like the addition of instructions to the x86-64 with successive avx extensions). They all, however, share an intermediate (virtual) instruction set, somewhat similar to a compiler's intermediate representation. Specifically, it is somewhat of a parallel to the OpenCL-standard-related representation, spir-v. Continuing the compilation toolchain, PTX is further compiled into one of several GPU-microarchitecture specific assembly languages (sass) for actual execution.

Here is an example of a simple CUDA kernel and the PTX resulting from its compilation:

__global__ void square(int *array, int length) {
    int pos = threadIdx.x + blockIdx.x * blockDim.x;
    if (pos < length)
        array[pos] = array[pos] * array[pos];
}

Resulting PTX (after name demangling):

.visible .entry square(int*, int)(
        .param .u64 square(int*, int)_param_0,
        .param .u32 square(int*, int)_param_1
)
{
        ld.param.u64        %rd1, [square(int*, int)_param_0];
        ld.param.u32        %r2, [square(int*, int)_param_1];
        mov.u32             %r3, %tid.x;
        mov.u32             %r4, %ntid.x;
        mov.u32             %r5, %ctaid.x;
        mad.lo.s32          %r1, %r4, %r5, %r3;
        setp.ge.s32         %p1, %r1, %r2;
        @%p1 bra            BB0_2;

        cvta.to.global.u64  %rd2, %rd1;
        mul.wide.s32        %rd3, %r1, 4;
        add.s64             %rd4, %rd2, %rd3;
        ld.global.u32       %r6, [%rd4];
        mul.lo.s32          %r7, %r6, %r6;
        st.global.u32       [%rd4], %r7;

        ret;
}

For more information on PTX in general, and on the specific instructions and data access syntax in the example above, consult the nVIDIA PTX Referene.

164 questions

vote

2 answers

Is inline PTX more efficient than C/C++ code?

I have noticed that PTX code allows for some instructions with complex semantics, such as bit field extract (bfe), find most-significant non-sign bit (bfind), and population count (popc). Is it more efficient to use them explicitly rather than write…

optimization cuda ptx

asked Sep 02 '13 at 08:53

luxuia

3,459
1
12
8

vote

3 answers

type casting to unsigned long long in CUDA?

Basically what I want is an function works like hiloint2uint64(), just join two 32 bit integer and reinterpret the outcome as an uint64. I cannot find any function in CUDA that can do this, anyhow, is there any ptx code that can do that kind of type…

cuda gpu nvidia ptx

asked Aug 19 '13 at 11:36

user2188453

1,105
1
12
26

vote

2 answers

Bug in PTX ISA (carry propagation)?

Is there a bug in Cuda? I have run the following code on my GTX580 and r1 is zero at the end. I expect that it is one due to carry propagation? I have tested the code with Cuda Toolkit 4.2.9 and 5.5 and use "nvcc -arch=sm_20 bug.cu -o bug && ./bug"…

cuda ptx

asked Aug 09 '13 at 15:47

user4811

vote

1 answer

LLVM NVPTX Backend with CUDA5 Dynamic Parallism

Does LLVM's NVPTX backend (contributed by NVIDIA) have any support for the new Dynamic Parallelism feature found in CUDA5 / Compute Capability 3.5 devices?

cuda llvm nvidia ptx

asked Jun 28 '13 at 17:19

n00b101

vote

1 answer

cuModuleLoadDataEx ignores all options

This question is similar to cuModuleLoadDataEx options but I would like to bring the topic up again and in addition provide more information. When loading a PTX string with the NV driver via cuModuleLoadDataEx it seems to ignore all options all…

c++ cuda ptx

asked Jun 12 '13 at 15:30

ritter

7,447
7
51
84

vote

1 answer

Overloading the CUDA shuffle function makes the original ones invisible

I'm trying to implement my own 64-bit shuffle function in CUDA. However, if I do it like this: static __inline__ __device__ double __shfl_xor(double var, int laneMask, int width=warpSize) { int hi, lo; asm volatile( "mov.b64 { %0, %1 }, %2;"…

cuda shuffle ptx

asked Apr 11 '13 at 18:20

Rainn

vote

1 answer

how to find the active SMs?

Is there any way by which I can know the number of free/active SMs? Or atleast to read the voltage/power or temperature values of each SM by which I can know whether its working or not? (in real time while some job is getting executed on the gpu…

cuda gpgpu nvidia ptx

asked Feb 14 '13 at 11:23

Rakesh Kumar

vote

1 answer

Missing CUDA inline PTX constraint letter for 8 bit variables in order to disable L1 cache for 8 bit variable (bool)

INTRODUCTION In this question we can learn how to disable L1 cache for one single variable. Here is the accepted answer: As mentioned above you can use inline PTX, here is an example: __device__ __inline__ double ld_gbl_cg(const double *addr) { …

cuda inline-assembly cpu-cache type-constraints ptx

asked Jan 10 '13 at 15:55

Sam

vote

1 answer

CUDA __float_as_int in acosf implementation

CUDA C's maths function implementation (cuda/math_function.h) of acosf contains the passage: if (__float_as_int(a) < 0) { t1 = CUDART_PI_F - t1; } where a and t1 are floats and CUDART_PI_F is a float previously set to a numerical value close to…

c math cuda ptx

asked Dec 10 '12 at 13:21

ritter

7,447
7
51
84

vote

1 answer

Syntax on inline PTX code for CUDA

As written in Nvidia's Inline PTX Assembly document, the grammar for using inline assembly is: asm("temp_string" : "constraint"(output) : "constraint"(input)); Here are two examples: asm("vadd.s32.s32.s32 %0, %1.h0, %2.h0;" : "=r"(v) : "r"(a),…

cuda ptx

asked Jul 18 '12 at 16:38

gpunerd

votes

0 answers

Using NVCC-generated PTX file in OpenCL

I'm experimenting with using nvcc generated PTX file in an OpenCL-based host program. A simple testing kernel is written in both CUDA C and OpenCL C shown below. // exampleKernel_cuda.cu extern "C" { __global__ void exampleKernel(float *input,…

cuda gpu opencl nvcc ptx

asked Sep 02 '23 at 23:46

MCx

votes

1 answer

Confusion about __cvta_generic_to_shared

Nvidia Ampere GPU support feature: cp async from global mem to shared mem bypass L1 and register file. The corresponding PTX core is cp.async. Why need __cvta_generic_to_shared to convert a shared mem ptr (T*) to size_t? size_t smem_a_converted =…

cuda ptx

asked Aug 28 '23 at 12:47

foreverrookie

votes

0 answers

How can I run LLVM custom passes on the LLVM IR file of a CUDA code using opt?

I am having problem with running LLVM custom passes on the ".ll" file of a CUDA code. For example, I have a CUDA sample code named sample1.cu. I have compiled the CUDA code using the following command. ./bin/clang++ -flegacy-pass-manager -g -Xclang…

cuda clang llvm ptx opt

asked Jul 22 '23 at 11:21

Tauro

votes

0 answers

Optix LNK2001 unresolved external symbol

I'm trying to follow a step by step tutorial on Optix at github, but in one file there's this line: extern "C" char embedded_ptx_code[]; which leads to the error at the title since there's no definition of that C string. I don't get it , am I…

c++ cuda lnk2001 ptx optix

asked Feb 25 '23 at 15:04

mrmanet

votes

1 answer

How to get instruction cost in NVIDIA GPU？

I want to know about how many clock instruction cost in nvidia gpu, such as add, mul，ld/st and so on， How can I do ? I had wrote some code to test and run in 2080Ti const int test_cnt = 1000; auto lastT = clock64(); uint32_t…

cuda gpu nvidia gpgpu ptx

asked Jan 17 '23 at 07:06

sorfkc

Prev 1 2 3

…

10 11 Next