Cuda demoting double to float error despite no doubles in code

Question

I'm writing a kernel using PyCUDA. My GPU device only supports compute capability 1.1 (arch sm_11) and so I can only use floats in my code. I've taken great effort to ensure I'm doing everything with floats, but despite that, there is a particular line in my code that keeps causing a compiler error.

The chunk of code is:

  // Gradient magnitude, so 1 <= x <= width, 1 <= y <= height. 
  if( j > 0 && j < im_width && i > 0 && i < im_height){
    gradient_mag[idx(i,j)] = float(sqrt(x_gradient[idx(i,j)]*x_gradient[idx(i,j)] + y_gradient[idx(i,j)]*y_gradient[idx(i,j)]));
  }

Here, idx() is a __device__ helper function that returns a linear index based on pixel indices i and j, and it only works with integers. I use it throughout and it doesn't give errors anywhere else, so I strongly suspect it's not idx(). The sqrt() call is just from the standard C math functions which support floats. All of the arrays involved, x_gradient , y_gradient, and gradient_mag are all float* and they are part of the input to my function (i.e. declared in Python, then converted to device variables, etc.).

I've tried removing the extra cast to float in my code above, with no luck. I've also tried doing something completely stupid like this:

 // Gradient magnitude, so 1 <= x <= width, 1 <= y <= height. 
 if( j > 0 && j < im_width && i > 0 && i < im_height){
    gradient_mag[idx(i,j)] = 3.0f; // also tried float(3.0) here
  }

All of these variations give the same error:

 pycuda.driver.CompileError: nvcc said it demoted types in source code it compiled--this is likely not what you want.
 [command: nvcc --cubin -arch sm_11 -I/usr/local/lib/python2.7/dist-packages/pycuda-2011.1.2-py2.7-linux-x86_64.egg/pycuda/../include/pycuda kernel.cu]
 [stderr:
 ptxas /tmp/tmpxft_00004329_00000000-2_kernel.ptx, line 128; warning : Double is not supported. Demoting to float
 ]

Any ideas? I've debugged many errors in my code and was hoping to get it working tonight, but this has proved to be a bug that I cannot understand.

Added -- Here is a truncated version of the kernel that produces the same error above on my machine.

 every_pixel_hog_kernel_source = \
 """
 #include <math.h>
 #include <stdio.h>

 __device__ int idx(int ii, int jj){
     return gridDim.x*blockDim.x*ii+jj;
 }

 __device__ int bin_number(float angle_val, int total_angles, int num_bins){ 

     float angle1;   
     float min_dist;
     float this_dist;
     int bin_indx;

     angle1 = 0.0;
     min_dist = abs(angle_val - angle1);
     bin_indx = 0;

     for(int kk=1; kk < num_bins; kk++){
         angle1 = angle1 + float(total_angles)/float(num_bins);
         this_dist = abs(angle_val - angle1);
         if(this_dist < min_dist){
             min_dist = this_dist;
             bin_indx = kk;
         }
     }

     return bin_indx;
 }

 __device__ int hist_number(int ii, int jj){

     int hist_num = 0;

     if(jj >= 0 && jj < 11){ 
         if(ii >= 0 && ii < 11){ 
             hist_num = 0;
         }
         else if(ii >= 11 && ii < 22){
             hist_num = 3;
         }
         else if(ii >= 22 && ii < 33){
             hist_num = 6;
         }
     }
     else if(jj >= 11 && jj < 22){
         if(ii >= 0 && ii < 11){ 
             hist_num = 1;
         }
         else if(ii >= 11 && ii < 22){
             hist_num = 4;
         }
         else if(ii >= 22 && ii < 33){
             hist_num = 7;
         }
     }
     else if(jj >= 22 && jj < 33){
         if(ii >= 0 && ii < 11){ 
             hist_num = 2;
         }
         else if(ii >= 11 && ii < 22){
             hist_num = 5;
         }
         else if(ii >= 22 && ii < 33){
             hist_num = 8;
         }
     }

     return hist_num;
 }

  __global__ void every_pixel_hog_kernel(float* input_image, int im_width, int im_height, float* gaussian_array, float* x_gradient, float* y_gradient, float* gradient_mag, float* angles, float* output_array)
  {    
      /////
      // Setup the thread indices and linear offset.
      /////
      int i = blockDim.y * blockIdx.y + threadIdx.y;
      int j = blockDim.x * blockIdx.x + threadIdx.x;
      int ang_limit = 180;
      int ang_bins = 9;
      float pi_val = 3.141592653589f; //91

      /////
      // Compute a Gaussian smoothing of the current pixel and save it into a new image array
      // Use sync threads to make sure everyone does the Gaussian smoothing before moving on.
      /////
      if( j > 1 && i > 1 && j < im_width-2 && i < im_height-2 ){

            // Hard-coded unit standard deviation 5-by-5 Gaussian smoothing filter.
            gaussian_array[idx(i,j)] = float(1.0/273.0) *(
            input_image[idx(i-2,j-2)] + float(4.0)*input_image[idx(i-2,j-1)] + float(7.0)*input_image[idx(i-2,j)] + float(4.0)*input_image[idx(i-2,j+1)] + input_image[idx(i-2,j+2)] + 
            float(4.0)*input_image[idx(i-1,j-2)] + float(16.0)*input_image[idx(i-1,j-1)] + float(26.0)*input_image[idx(i-1,j)] + float(16.0)*input_image[idx(i-1,j+1)] + float(4.0)*input_image[idx(i-1,j+2)] +
            float(7.0)*input_image[idx(i,j-2)] + float(26.0)*input_image[idx(i,j-1)] + float(41.0)*input_image[idx(i,j)] + float(26.0)*input_image[idx(i,j+1)] + float(7.0)*input_image[idx(i,j+2)] +
            float(4.0)*input_image[idx(i+1,j-2)] + float(16.0)*input_image[idx(i+1,j-1)] + float(26.0)*input_image[idx(i+1,j)] + float(16.0)*input_image[idx(i+1,j+1)] + float(4.0)*input_image[idx(i+1,j+2)] +
            input_image[idx(i+2,j-2)] + float(4.0)*input_image[idx(i+2,j-1)] + float(7.0)*input_image[idx(i+2,j)] + float(4.0)*input_image[idx(i+2,j+1)] + input_image[idx(i+2,j+2)]);
     }
     __syncthreads();

     /////
     // Compute the simple x and y gradients of the image and store these into new images
     // again using syncthreads before moving on.
     /////

     // X-gradient, ensure x is between 1 and width-1
     if( j > 0 && j < im_width){
         x_gradient[idx(i,j)] = float(input_image[idx(i,j)] - input_image[idx(i,j-1)]);
     }
     else if(j == 0){
         x_gradient[idx(i,j)] = float(0.0);
     }

    // Y-gradient, ensure y is between 1 and height-1
    if( i > 0 && i < im_height){
         y_gradient[idx(i,j)] = float(input_image[idx(i,j)] - input_image[idx(i-1,j)]);
    }
    else if(i == 0){
        y_gradient[idx(i,j)] = float(0.0);
    }
    __syncthreads();

    // Gradient magnitude, so 1 <= x <= width, 1 <= y <= height. 
    if( j < im_width && i < im_height){

        gradient_mag[idx(i,j)] = float(sqrt(x_gradient[idx(i,j)]*x_gradient[idx(i,j)] + y_gradient[idx(i,j)]*y_gradient[idx(i,j)]));
    }
    __syncthreads();

    /////
    // Compute the orientation angles
    /////
    if( j < im_width && i < im_height){
        if(ang_limit == 360){
            angles[idx(i,j)] = float((atan2(y_gradient[idx(i,j)],x_gradient[idx(i,j)])+pi_val)*float(180.0)/pi_val);
        }
        else{
            angles[idx(i,j)] = float((atan( y_gradient[idx(i,j)]/x_gradient[idx(i,j)] )+(pi_val/float(2.0)))*float(180.0)/pi_val);
        }
    }
    __syncthreads();

    // Compute the HoG using the above arrays. Do so in a 3x3 grid, with 9 angle bins for each grid.
    // forming an 81-vector and then write this 81 vector as a row in the large output array.

    int top_bound, bot_bound, left_bound, right_bound, offset;
    int window = 32;

    if(i-window/2 > 0){
        top_bound = i-window/2;
        bot_bound = top_bound + window;
    }
    else{
        top_bound = 0;
        bot_bound = top_bound + window;
    }

    if(j-window/2 > 0){
        left_bound = j-window/2;
        right_bound = left_bound + window;
    }
    else{
        left_bound = 0;
        right_bound = left_bound + window;
    }

    if(bot_bound - im_height > 0){
        offset = bot_bound - im_height;
        top_bound = top_bound - offset;
        bot_bound = bot_bound - offset;
    }

    if(right_bound - im_width > 0){
        offset = right_bound - im_width;
        right_bound = right_bound - offset;
        left_bound = left_bound - offset;
    }

    int counter_i = 0;
    int counter_j = 0;
    int bin_indx, hist_indx, glob_col_indx, glob_row_indx;
    int row_width = 81; 

    for(int pix_i = top_bound; pix_i < bot_bound; pix_i++){
        for(int pix_j = left_bound; pix_j < right_bound; pix_j++){

            bin_indx = bin_number(angles[idx(pix_i,pix_j)], ang_limit, ang_bins);
            hist_indx = hist_number(counter_i,counter_j);

            glob_col_indx = ang_bins*hist_indx + bin_indx;
            glob_row_indx = idx(i,j);

            output_array[glob_row_indx*row_width + glob_col_indx] = float(output_array[glob_row_indx*row_width + glob_col_indx] + float(gradient_mag[idx(pix_i,pix_j)]));


            counter_j = counter_j + 1; 
        }
        counter_i = counter_i + 1;
        counter_j = 0;
    }

}
"""

Try `sqrtf()` maybe or `std::sqrt()`. What has Python to do with this? — Kerrek SB, Nov 29 '11 at 04:17
Thanks for the suggestion, but I have just tried it and `sqrtf()` did not help. I doubt this is PyCUDA specific, but thought it was relevant to include that detail in case it happens to be related to the way device variable are used in PyCUDA. — ely, Nov 29 '11 at 04:20
I will side with @KerrekSB here, the line number its reporting is for the .ptx file, so you might be looking in the wrong place. — Mead, Nov 29 '11 at 05:58
Post the complete kernel code, your assumption about where the error is coming from is incorrect. — talonmies, Nov 29 '11 at 06:18
It's got to be that line. I removed the semicolon from the end, and it gives me an error saying that line 128 expected a semicolon. I checked all of that before posting here. The whole kernel is about 800 lines long, and I feel it's unnecessary to post the whole thing. — ely, Nov 29 '11 at 06:42
The reason for the suspicion is the error is mentioning ptxas, so the intermediate ptx has already been generated; introducing a colon to find the line number wouldn't be caught at the same processing step. It'd probably be caught by cudafe or something similar in an earlier phase. However, introducing another double literal would be caught at the pxtas step, so why not introduce extra doubles to help identify lines? — Mead, Nov 29 '11 at 07:23
@EMS: it isn't that line. Introducing a syntax error in the C code at the line proves nothing - the error you are asking about is being generated by the assembler, not the compiler. If you want help, post the complete kernel code, because at the moment you are looking in the wrong place. — talonmies, Nov 29 '11 at 08:55
I added a truncated version of my kernel to the OP above. This truncated version gives me the same "demoting double to float" error when I try to compile it. Let me know if you figure out that it's not the line that I thought I had traced it to. — ely, Nov 29 '11 at 17:46

Mead · Answer 1 · 2011-12-01T07:44:12.220

Here's an unmistakable case of using doubles:

 gaussian_array[idx(i,j)] = float(1.0/273.0) *

See the double literals being divided?

But really, use float literals instead of double literals cast to floats - the casts are ugly, and I suggest they will hide bugs like this.

-------Edit 1/Dec---------

Firstly, thanks @CygnusX1, constant folding would prevent that calculation - I didn't even think of it.

I've tried to reproduce the environment of the error: I installed the CUDA SDK 3.2 (That @EMS has mentioned they seem to use in the lab), compiling the truncated kernel version above, and indeed nvopencc did optimize the above calculation away (thanks @CygnusX1), and indeed it didn't use doubles anywhere in the generated PTX code. Further, ptxas didn't give the error received by @EMS. From that, I thought the problem is outside of the every_pixel_hog_kernel_source code itself, perhaps in PyCUDA. However, using PyCUDA 2011.1.2 and compiling with that still does not produce a warning like in @EMS's question. I can get the error in the question, however it is by introducing a double calculation, such as removing the cast from gaussian_array[idx(i,j)] = float(1.0/273.0) *

To get to the same python case, does the following produce your error:

import pycuda.driver as cuda
from pycuda.compiler import compile

x=compile("""put your truncated kernel code here""",options=[],arch="sm_11",keep=True)

It doesn't produce an error in my circumstance, so there is a possibility I simply can't replicate your result. However, I can give some advice. When using compile (or SourceModule), if you use keep=True, python will print out the folder where the ptx file is being generated just before showing the error message. Then, if you can examine the ptx file generated in that folder and looking where .f64 appears it should give some idea of what is being treated as a double - however, deciphering what code that is in your original kernel is difficult - having the simplest example that produces your error will help you.

After changing all of the double literals to float literals, I still get the exact same error. — ely, Nov 30 '11 at 07:46
Those are constants. Compiler is able to optimize that away! — CygnusX1, Nov 30 '11 at 18:08
Thank you very much for the `keep=True` comment. I still haven't resolved this, but that should be helpful. — ely, Dec 02 '11 at 03:07
If the double is a constant, it hopefully will appear in this manner: `mov.f64 %fd2, 0d3f6e01e01e01e01e; // 0.003663` (storing the double value 0.003663 ready for a calculation). Part of the problem when finding the original equivalent is constant folding, like CygnusX1 pointed out. (This example was produced by when I compiled 1.0/273.0 without the float cast - the generated ptx used the double result of that calculation, 0.003663, instead of actually doing the calculation ). The `.f64` operations are the ones working with doubles. — Mead, Dec 02 '11 at 04:13

score 1 · Answer 2 · answered Nov 29 '11 at 20:22

1

Your problem is here:

angle1 = 0.0;

0.0 is a double precision constant. 0.0f is a single precision constant.

answered Nov 29 '11 at 20:22

talonmies

70,661
34
192
269

This is incorrect. Notice that I declared `angle1` as a float, so assigning it `0.0` automatically casts it as a float. Even if I add the f as you suggest, I get the same demoting double to float error with the same line number 128 detail. – ely Nov 29 '11 at 22:03
His point is correct though; you shouldn't have 0.0, it should be 0.0f. The same goes for everywhere; you shouldn't be using double literals, then casting them down; you should be using float literals. – Mead Nov 30 '11 at 03:51
Ah, and there might be your problem: float(1.0/273.0). You're doing double division, then converting to a float. – Mead Nov 30 '11 at 03:55

CygnusX1 · Answer 3 · 2011-11-30T06:26:27.437

0

(a comment, not an answer, but it is too big to put it as a comment)

Could you provide the PTX code around the line where the error occurs?

I tried compiling a simple kernel using the code you provided:

__constant__ int im_width;
__constant__ int im_height;

__device__ int idx(int i,int j) {
    return i+j*im_width;
}

__global__ void kernel(float* gradient_mag, float* x_gradient, float* y_gradient) {
    int i = threadIdx.x;
    int j = threadIdx.y;
  // Gradient magnitude, so 1 <= x <= width, 1 <= y <= height.
  if( j > 0 && j < im_width && i > 0 && i < im_height){
    gradient_mag[idx(i,j)] = float(sqrt(x_gradient[idx(i,j)]*x_gradient[idx(i,j)] + y_gradient[idx(i,j)]*y_gradient[idx(i,j)]));
  }
}

using:

nvcc.exe -m32 -maxrregcount=32 -gencode=arch=compute_11,code=\"sm_11,compute_11\" --compile -o "Debug\main.cu.obj" main.cu

got no errors.

Using the CUDA 4.1 beta compiler

Update

I tried compiling your new code (I am working within CUDA/C++, not PyCUDA, but this shouldn't matter). Didn't catch the error either! Used CUDA 4.1 and CUDA 4.0. What is your version of CUDA installation?

C:\>nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2011 NVIDIA Corporation
Built on Wed_Oct_19_23:13:02_PDT_2011
Cuda compilation tools, release 4.1, V0.2.1221

edited Nov 30 '11 at 06:26

answered Nov 29 '11 at 08:33

CygnusX1

20,968
5
65
109

I'm not sure what ptx code means. I am using PyCUDA and haven't heard of this concept. Does PyCUDA generate this intermediate stuff the same way that CUDA does? – ely Nov 29 '11 at 17:44
From your error see that behind the scenes nvcc is running; `nvcc --cubin -arch sm_11 -I/usr/local/lib/python2.7/dist-packages/pycuda-2011.1.2-py2.7-linux-x86_64.egg/pycuda/../include/pycuda kernel.cu` Skip to actual page 19 (numbered as 17) of [link](http://docs.google.com/viewer?a=v&q=cache:W4YYleKadocJ:sbel.wisc.edu/Courses/ME964/2008/Documents/nvccCompilerInfo.pdf+nvccCompilerInfo&hl=en&gl=au&pid=bl&srcid=ADGEESisDxumBUhk8Mows_ZobSzD3Ygia2JlDdxzZ1_fEokIVyUkCT2yLmCe_XkqL_3D3ayFR9oijYAzH1Uul3haqMZYcBklD-FKI8VR4G5NX22k-THgQiQTE0zWf0cVieVVXettyaz-&sig=AHIEtbRPaTLWdIYzWkjQkClImoFxxsgWFA) – Mead Nov 30 '11 at 03:29
From that link you can see an example behind the scenes compilation; on actual page 21 (numbered as 19) you can see where pxtas comes in, taking a ptx file and generated a cubin. See actual page 22 (numbered as 20) for a diagram of the example compilation. Anyway, you should see that pxtas, which is generating the error, is way down in the chain. Generating an error at an earlier stage in the compilation will not help you get the right line number. – Mead Nov 30 '11 at 03:33
It looks to be version 3.2. It's on an academic lab computer, so it's unlikely it can be changed to a newer version in the short term time scale that I am working on. – ely Dec 01 '11 at 00:09
Installed and tried v3.2. Still cannot reproduce the error (within the C++ environment). Are you certain it is -this- and not any other code causing it? If you can control the compilation flags in pyCUDA, maybe you could try the "--keep" flag and then inspect the produced PTX code? (--keep prevents deletion of intermediate files) – CygnusX1 Dec 01 '11 at 23:07

Cuda demoting double to float error despite no doubles in code

3 Answers3