NVCC compilation options for generating the best code (using JIT)

Question

I am trying to understand nvcc compilation phases but I am a little bit confused. Because I don't know the exact hardware configuration of the machine that will run my software, I want to use JIT compilation feature in order to generate the best possible code for it. In the NVCC documentation I found this:

"For instance, the command below allows generation of exactly matching GPU binary code, when the application is launched on an sm_10, an sm_13, and even a later architecture:"

nvcc x.cu -arch=compute_10 -code=compute_10

So my understanding is that the above options will produce the best/fastest/optimum code for the current GPU. Is that correct? I also read that the default nvcc options are:

nvcc x.cu –arch=compute_10 -code=sm_10,compute_10

If the above is indeed correct, why I can't use any compute_20 features in my application?

score 4 · Accepted Answer · answered May 30 '14 at 08:44

When you specify a target architecture you are restricting yourself to the features available in that architecture. That's because the PTX code is a virtual assembly code, so you need to know the features available during PTX generation. The PTX will be JIT compiled to the GPU binary code (SASS) for whatever GPU you are running on, but it can't target newer architecture features.

I suggest that you pick a minimum architecture (for example, 1.3 if you want double precision or 2.0 if you want a Fermi-or-later feature) and then create PTX for that architecture AND newer base architectures. You can do this in one command (although it will take longer since it requires multiple passes through the code) and bundle everything into a single fat binary.

An example command line may be:

nvcc <general options> <filename.cu> \
    -gencode arch=compute_13,code=compute_13 \
    -gencode arch=compute_20,code=compute_20 \
    -gencode arch=compute_30,code=compute_30 \
    -gencode arch=compute_35,code=compute_35

That will create four PTX versions in the binary. You could also compile to selected GPUs at the same time which has the advantage of avoiding the JIT compile time for your users but also grows your binary size.

Check out the NVCC manual for more information on this.

Which PTX version is used when you run the fat binary executable? — user2023370, Oct 24 '19 at 11:01
It depends which GPU architecture you are running on. It will pick the latest PTX that is compatible. This answer is 5 years old, if you were to use that command on a Volta or Turing generation GPU then it would take the compute_35 PTX and JIT compile it to the current GPU. — Tom, Oct 24 '19 at 11:04
No, it doesn't need to. Once it has JIT-compiled for the current GPU it will keep that in a cache and reused the compiled version when you launch the kernel again. See https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars for info on where the cache is stored (CUDA_CACHE_PATH variable). — Tom, Oct 24 '19 at 11:09

NVCC compilation options for generating the best code (using JIT)

1 Answers1