Defining MACRO depending on GPU compute capability

Question

I have compiled my CUDA/C++ project for all GPU compute capabilities by generating all PTX assembly codes (1.x , 2.x , 3.x , 5.0). The problem is that my kernel efficiency for a given CC depends on the value X of MACRO (defined at compile time). So, Is there a way to associate the value of X to a a specific CC ? I have tried using __CUDA_ARCH__ as follow but it says identifier MACRO is undefined

Thank you.

#ifdef __CUDA_ARCH__
    #if (__CUDA_ARCH__ >= 500)
        #define MACRO 10
    #elseif (__CUDA_ARCH__ < 500)
        #define MACRO 32
    #endif
#endif

__global__ kernel ()
{
    // some device code using MACRO
}

int main()
{
    // some host code using MACRO
    kernel <<< >>> ();

    return 0; 
}

What about if `__CUDA_ARCH__` is between 301 and 499? You should always defined conditionally defined macros to some default first. Or what if it is not defined at all? Add some kind of trap for this. — Vality, Aug 14 '14 at 09:24
Where did you use `__CUDA_ARCH__`? You can only use it in `.cu` Files that will be compiled for the gpu.Just have a look at these both SO questions - [1](http://stackoverflow.com/questions/9056183/compiling-for-compute-capability-2-x-in-cuda-c-for-vs2010/9100961#9100961) and [2](http://stackoverflow.com/questions/21864602/different-kernels-for-different-architectures/21865111#21865111) — hubs, Aug 14 '14 at 09:28
Possible problems are the ones listed above: you're missing CCs like 4.2 and/or you're using this stuff in a non-NVCC compiled file — Marco A., Aug 14 '14 at 09:30
`// some host code using MACRO` This part won't work. The `__CUDA_ARCH__` macro is [only defined](http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#virtual-architecture-identification-macro) when the device code compiler is compiling the code, which is not true for your host code. Since this part presumably doesn't affect the "efficiency" of your device code, you can handle this (host portion) by querying device properties at runtime, and conditioning your code accordingly. Using `MACRO` should be fine in device code, or else provide a complete example that shows the error. — Robert Crovella, Aug 14 '14 at 13:35

score 4 · Accepted Answer · answered Aug 14 '14 at 13:35

When GPU programs are compiled there are two passes: the host pass and the device pass.

The host pass compiles all the host code once, and in this pass __CUDA_ARCH__ is not defined.

The device pass compiles all the device code (__global__ and __device__) for each targetted compute capability. In this case __CUDA_ARCH__ is defined.

Your problem is that in your host code (int main()) you attempt to use MACRO. As in the host pass __CUDA_ARCH__ is not defined, MACRO is also undefined. You cannot use anything that depends upon __CUDA_ARCH__ in host code. Instead, for host code you must use cudaGetDeviceProporties to determine your current compute capability.

Marco A. · Answer 2 · 2014-08-14T14:47:35.360

The problem has nothing to do with CUDA but rather with the #elseif preprocessor directive which isn't recognized and thus the following

#ifdef __CUDA_ARCH__
    #if (__CUDA_ARCH__ >= 500)
        #define MACRO 10
    #elseif (__CUDA_ARCH__ < 500)
        #define MACRO 32
    #endif
#endif

it's just plain equivalent to

#ifdef __CUDA_ARCH__
    #if (__CUDA_ARCH__ >= 500)
        #define MACRO 10
    #helloworld (__CUDA_ARCH__ < 500)
        #define MACRO 32
    #endif
#endif

and since macros are just a textual substitution will trigger "undefined MACRO used" whenever that define is used and the condition wasn't met.

Solution: use #elif

#ifdef __CUDA_ARCH__
    #if (__CUDA_ARCH__ >= 500)
        #define MACRO 10
    #elif (__CUDA_ARCH__ < 500)
        #define MACRO 32
    #endif
#endif

Edit: NVCC is just a compilation driver, it steers the compilation through device and host passes. Your define value won't be visible in a host pass since __CUDA_ARCH__ won't be defined altogether. Generally you shouldn't deal with it at compile time from host code. Evaluate the possibility of querying the capabilities at runtime (a possible solution off the top of my head could be to compile specialized copies of the kernel for multiple architecture if the per-architecture optimizations offered by the PTX intermediate representation don't suit you and you need to do something more "consistent").

same problem. Btw, I'm using Nvidia Nsight EE (cc3.0 device). Take a sneak peek at the following [link](http://postimg.org/image/oj0mq2ipz/) — Djeb, Aug 14 '14 at 12:44
I know Im supposed to have the `#define MACRO` following the `elif` colored as the kernel is... unfortunately not... — Djeb, Aug 14 '14 at 12:46
I can't remember but I suppose `__CUDA_ARCH__` isn't defined in host code so you shouldn't have any highlight. Try to compile and see what happens. Also notice that if that isn't defined, your `MACRO` define won't be defined as well. You should do something for a host path. — Marco A., Aug 14 '14 at 13:15

Defining MACRO depending on GPU compute capability

2 Answers2