3

I'm trying to write a CUDA application that is templated for floats and doubles, since I would like to be able to run on both single and double precision cards. The application uses dynamically allocated global, dynamically allocated shared, as well as constant memory and static global memory.

I've seen examples for templating dynamically allocated global and shared memory variables. And I realize that constant memory is static and so templating is generally not possible, as stated in this post: Defining templated constant variables in cuda.

I've been unable to find any workarounds to this constant memory issue, which surprises me because I'm sure I'm not the first to encounter this problem. At the moment it seems I am faced with having to write two copies of the same application, one for doubles and one for floats, if I want to use constant memory. I'm hoping this isn't the case.

As a workaround, I'm considering writing a (virtual?) base class that is templated and implements everything except for constant memory variable declaration. Then I'd like to write two classes that inherit from the base (one for floats, one for doubles) that mainly just handle constant variable declaration. My question is whether this strategy will work or if there is an obvious flaw? I just thought I'd ask before implementing the design only to find it doesn't work. If this strategy does not work, are there any other proven strategies that at least alleviate the problem? Or will I simply have to write two copies of the application, one for float and one for double?

Community
  • 1
  • 1
Michael Puglia
  • 145
  • 2
  • 9

1 Answers1

2

Note, this answer is only of historical interest, or for users who are using a CUDA toolkit version 6.5 or older. Since CUDA 7.0, there are no supported CUDA devices that only support float, so the nvcc CUDA compiler no longer retains the ability described below to automatically demote double to float.

Since you mention that you're concerned only about float and double, and you mention that you only are concerned about float on devices that don't support double, it seems like you could take advantage of the nvcc compiler automatic demotion of double to float in order to handle this.

Here's an example using __constant__ memory:

$ cat t264.cu
#include <stdio.h>

#define DSIZE 64

__constant__ double my_const_data[DSIZE];

__global__ void my_kernel(double *data){
  data[1] = my_const_data[0];
  data[0] = sqrt(my_const_data[0]);
}

int main(){
  double my_data[DSIZE], h_data[DSIZE], *d_data;
  my_data[0] = 256.0;
  cudaMemcpyToSymbol(my_const_data, my_data, sizeof(double)*DSIZE);
  printf("hello\n");
  cudaMalloc((void **)&d_data, sizeof(double)*DSIZE);
  my_kernel<<<1,1>>>(d_data);
  cudaMemcpy(&h_data, d_data, sizeof(double)*DSIZE, cudaMemcpyDeviceToHost);
  printf("data = %lf\n", h_data[1]);
  printf("sqrt = %lf\n", h_data[0]);
  return 0;
}

$ nvcc -o t264 t264.cu
ptxas /tmp/tmpxft_00003228_00000000-5_t264.ptx, line 62; warning : Double is not supported. Demoting to float
$ ./t264
hello
data = 256.000000
sqrt = 16.000000
$
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • 2
    I'd closely examine the generated microcode for the float case before putting this approach into production. – ArchaeaSoftware Oct 22 '13 at 01:36
  • I guess there is a good work-around that I was missing. It seems like this will suit my needs, so accepting the answer. Thanks for the pointer. – Michael Puglia Oct 22 '13 at 03:09