0

I have a kernel which shows highest performance for different block sizes when running on Kepler and Fermi hardware. I would like, at compile-time, to check the current architecture target and define a THREADS_PER_BLOCK macro to i) launch the kernel with; ii) determine the number of blocks necessary; iii) statically set the shared memory size in the kernel.

The below demonstrates what I am attempting to do. Suppose I am targeting GK104 hardware, and hence use nvcc -arch=sm_30. This will still result in THREADS_PER_BLOCK = 256 since __CUDA_ARCH__ is not defined for the host code compilation. (I understand, from e.g. this answer, why it can't work this way.)

#if __CUDA_ARCH__ >= 300
#define THREADS_PER_BLOCK 512
#else
#define THREADS_PER_BLOCK 256
#endif

__global__ void some_kernel(int* a, int* b) {
    __shared__ sm_data[THREADS_PER_BLOCK];
    // Do something.
}

int main(void) {
    // Initialize data.
    // Calculate blocks based on THREADS_PER_BLOCK, problem size and some max.
    some_kernel<<blocks, THREADS_PER_BLOCK>>>(d_a, d_b)
    return 0;
}

I could check device properties at run-time and use dynamic shared memory, but would like to know if this can be hard-coded at compile-time without e.g. having to manually add a -DFERMI or -DKEPLER and setting THREADS_PER_BLOCK based on that. NB: Any users of this code will be compiling it themselves, almost certainly for one architecture, so this isn't an unreasonable option. It just seems superfluous in light of passing the -arch= flag.

Community
  • 1
  • 1
Sam
  • 557
  • 6
  • 20
  • 3
    I see two possibilities (both of which I have used before): (1) Use dynamically allocated shared memory and configure the kernel at runtime based on GPU architecture. (2) templatize the kernel, instantiate as many different versions as desired at compile time, then select the appropriate instance at run time based on GPU architecture. – njuffa Nov 07 '13 at 22:59
  • 1
    Here you can find a solution for this problem : http://nvlabs.github.io/moderngpu/performance.html#launchbox – vinograd47 Nov 08 '13 at 07:47
  • @njuffa yeah, dynamic shared memory was my back-up option. I'm assuming there is some overhead associated with this? (I can always test.) – Sam Nov 08 '13 at 13:48
  • @jet47 - looks like it could do the job with minimal effort on my part, cheers – Sam Nov 08 '13 at 13:54
  • 1
    @Sam: To my knowledge, dynamically sizing the shared memory as part of a kernel launch does not increase the kernel launch time. If you have multiple arrays in shared memory, there will be a few multiplies and adds to determine the starting pointer for each array, but that is a small one-time cost. – njuffa Nov 08 '13 at 16:34

1 Answers1

2

nvcc compiler does not detect locally available GPUs, it always targets SM 1.0 by default. Otherwise it could introduce some quite confusing behavior when building on different systems.

To compile for the available device, you either need to ask user to specify the SM version or run some detection code during the build time. I'm convinced that it is easier to put hardware detection code into runtime and then configure your kernel launch as desired.

Eugene
  • 9,242
  • 2
  • 30
  • 29
  • From the poster's question, it seems that he would like to avoid doing it at runtime... – Vitality Nov 07 '13 at 22:45
  • @JackOLantern Thanks. I redid the answer. – Eugene Nov 07 '13 at 22:53
  • Sorry, I didn't mention it in my question, but I am already compiling with e.g. `-arch=sm_30`, so specifying the target at compile-time. I would just like that information to affect the host code/kernel launch, if that is possible. Updated question to clarify this. – Sam Nov 08 '13 at 13:55
  • 1
    These NVCC flags do not have any bearing on the host code. Host code is compiled once regardless of the number of SM version specified - unlike device code. Thus NVCC cannot set predefined macro values for the host code. – Eugene Nov 08 '13 at 16:46