2

The racecheck tool reported memory races with my application. I've isolated it to the CUFFT exec functions.

Am I doing something wrong? If not, how can I make racecheck ignore this?

Here is a minimal example that when run in cuda-memcheck --tool racecheck produces a bunch of 'hazards' like

========= Race reported between Write access at 0x00000a30 in void spVector0128C::kernelTex<unsigned int, float, fftDirection_t=-1, unsigned int=8, unsigned int=8, LUT, ALL, WRITEBACK>(kernel_parameters_t<fft_tex_t, unsigned int, float>)
=========     and Read access at 0x00000a70 in void spVector0128C::kernelTex<unsigned int, float, fftDirection_t=-1, unsigned int=8, unsigned int=8, LUT, ALL, WRITEBACK>(kernel_parameters_t<fft_tex_t, unsigned int, float>) [4 hazards]

The example

#include <cufft.h>
#include <iostream>

#define ck(cmd) if ( cmd) { std::cerr << "error at line " << __LINE__ << std::endl;exit(1);}

int main(int argc,char ** argv)
{
    int nfft=128;
    cufftComplex * ibuf;
    cufftComplex * obuf;
    ck( cudaMalloc((void**)&ibuf, sizeof(cufftComplex)*nfft) );
    ck( cudaMalloc((void**)&obuf, sizeof(cufftComplex)*nfft) );
    ck( cudaMemset( ibuf,0,sizeof(cufftComplex)*nfft) );

    cufftHandle fft;
    ck( cufftPlanMany(&fft,1,&nfft,
                NULL,1,nfft,
                NULL,1,nfft,
                CUFFT_C2C,1) );

    ck( cufftExecC2C(fft,ibuf,obuf,CUFFT_FORWARD) );

    ck( cudaDeviceSynchronize() );
    cufftDestroy( fft );
    ck(cudaFree(ibuf));
    ck(cudaFree(obuf));
    return 0;
}
Mark Borgerding
  • 8,117
  • 4
  • 30
  • 51

1 Answers1

1

You are not doing anything wrong. I don't think it can be disabled similar to nvprof - cudaProfilerStart/cudaProfilerStop

Please observe slight differences between descriptions of __syncthreads and BAR.SYNC instruction:

__syncthreads - http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#synchronization-functions

waits until all threads in the thread block have reached this point

BAR.SYNC - http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#texture-instructions

Barriers are executed on a per-warp basis as if all the threads in a warp are active."

This is not exactly the same behavior. cuda-memcheck racecheck might follow __syncthreads definition and cuFFT kernels BAR.SYNC one

This is most likely going to be fixed in next release.

llukas
  • 359
  • 1
  • 4
  • `__syncthreads()` compiles to `bar.sync`, so their effect is identical. The difference is only in documentation, with the description of `__syncthreads()` being simplified to leave out the behavior inside conditional code. – tera Oct 16 '16 at 09:13
  • "This is most likely going to be fixed in next release." - Do you have any inside information? This discrepancy in level of documentation detail has been there for 7 or so major CUDA releases (basically ever since PTX became officially documented), so just observing past behavior it doesn't look to me like Nvidia intends to change that anytime soon. – tera Oct 16 '16 at 09:19
  • Let me point you to publicly available info that things are likely going to change in the future. Please have a look at last part of the presentation is devoted to "Cooperative Groups": http://on-demand.gputechconf.com/gtc/2016/presentation/s6224-mark-harris.pdf – llukas Oct 16 '16 at 21:12