6

I've got a strange problem with cuda,

In the below snippet,

#include <stdio.h>

#define OUTPUT_SIZE         26

typedef $PRECISION REAL;

extern "C"    
{
    __global__ void test_coeff ( REAL* results )
    {
        int id      = blockDim.x * blockIdx.x + threadIdx.x;

        int out_index  = OUTPUT_SIZE * id;
        for (int i=0; i<OUTPUT_SIZE; i++)
        {               
            results[out_index+i]=id;
            printf("q");
        }
    }
}

When I compile and run the code (via pycuda), it works as expected. When I remove the printf, then the results are weird - most of the array is populated correctly, but some of it seems completely random.

here's the full python code:

import numpy as np
import string

#pycuda stuff
import pycuda.driver as drv
import pycuda.autoinit

from pycuda.compiler import SourceModule

class MC:

    cudacodetemplate = """
    #include <stdio.h>

    #define OUTPUT_SIZE         26

    typedef $PRECISION REAL;

    extern "C"    
    {
        __global__ void test_coeff ( REAL* results )
        {
            int id      = blockDim.x * blockIdx.x + threadIdx.x;

            int out_index  = OUTPUT_SIZE * id;
            for (int i=0; i<OUTPUT_SIZE; i++)
            {               
                results[out_index+i]=id;
                //printf("q");
            }
        }
    }
    """

    def __init__(self, size, prec = np.float32):
        #800 meg should be enough . . .
        drv.limit.MALLOC_HEAP_SIZE = 1024*1024*800

        self.size       = size
        self.prec       = prec
        template        = string.Template(MC.cudacodetemplate)
        self.cudacode   = template.substitute( PRECISION = 'float' if prec==np.float32 else 'double')

        #self.module     = pycuda.compiler.SourceModule(self.cudacode, no_extern_c=True, options=['--ptxas-options=-v'])
        self.module     = SourceModule(self.cudacode, no_extern_c=True)

    def test(self, out_size):
        #try to precalc the co-efficients for just the elements of the vector that changes
        test  = np.zeros( ( 128, out_size*(2**self.size) ), dtype=self.prec )
        test2 = np.zeros( ( 128, out_size*(2**self.size) ), dtype=self.prec )

        test_coeff =  self.module.get_function ('test_coeff')
        test_coeff( drv.Out(test), block=(2**self.size,1,1), grid=( 128, 1 ) )
        test_coeff( drv.Out(test2), block=(2**self.size,1,1), grid=( 128, 1 ) )
        error = (test-test2)
        return error

if __name__ == '__main__':
    p1  = MC ( 5, np.float64 )
    err = p1.test(26)
    print err.max()
    print err.min()

Basically, with the printf in the kernel, the err is 0 - without it it prints some random error (on my machine around 2452 (for the max), and -2583 (for the min))

I have no idea why.

I've running cuda 4.2 on pycuda 2012.2 (windows 7 64bit) with a geforce 570.

Thanks.

Agi Hammerthief
  • 2,114
  • 1
  • 22
  • 38
user1726633
  • 356
  • 2
  • 10
  • Sorry, but I can't reproduce this using CUDA 4.2 on a 64 linux host and a GTX 670. Both single and double precision versions pass every time I run them using the kernel as you posted it. – talonmies Oct 08 '12 at 07:19
  • I think I have faulty hardware - although I'm not sure why all other cuda programs in the 4.2 GPU SDK work fine. I'll try running this with the same hardware in linux - then I'll try different hardware in windows and see . . . – user1726633 Oct 08 '12 at 07:34
  • I don't know pycuda, but in C/C++ you cannot use `printf` function inside `__global__` or `__device__` code. Is it possible with pycuda? – szamil Oct 26 '12 at 10:37
  • 1
    @szamil yes you can, on Fermi and later GPUs, in CUDA C/C++ or pycuda. – harrism Nov 05 '12 at 03:44
  • thanks! On my Quadro Fx 1600M with Compute Capability 1.1 it's impossible. – szamil Nov 05 '12 at 13:21
  • 1
    No problem here on a GT 650M with CUDA 5.0. This is probably a bug that has been fixed since, either in CUDA and/or the drivers. This is the kind of thing that you can post on NVIDIA forums and possibly their bug report platform. – BenC Apr 22 '13 at 09:09

1 Answers1

1

This is most likely due to compiler optimization. You are setting a block of memory OUTPUT_SIZE in length to the loop-constant value of id. In my experience the compiler will optimize that to a memcpy or whathaveyou unless there is something else going on in the loop -- ie your print statement. Furthermore, if you do not utilize that block of memory the compiler may optimize the entire loop away. Try fiddling with your optimization levels and see if you have different results.

Ethereal
  • 2,604
  • 1
  • 20
  • 20