I am using a cufft store callback in a complex-to-complex, out-of-place, 1D, batch FFT (i.e. I am doing many 1D FFTs of the same size). From Section 2.9.4 of the documentation, I expect this callback to be called exactly once for each output. See in particular this quote, taken verbatim from the link:
cuFFT will call the load callback routine, for each point in the input, once and only once. Similarly it will call the store callback routine, for each point in the output, once and only once.
Nevertheless, I seem to have an example that contradicts this. In the code below, I expect to see each of the numbers 0-19 appear exactly once, corresponding to the store callback being called exactly once for each output sample. However, when I do 504 1D FFTs of size 32, the store callback gets called twice for each output!
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <cufft.h>
#include <cufftXt.h>
// Very simple store callback: prints the index and does the store
static __device__ void stor_cb(void *a, size_t index, cufftComplex z,
void *cb_info, void *sharedmem) {
// Print the index. Each index should appear exactly once.
if (index < 20) printf("%8llu\n", index);
// Do the store
((cufftComplex *)a)[index] = z;
}
__device__ cufftCallbackStoreC stor_cb_ptr_d = stor_cb;
int main() {
size_t work_size;
// With these parameters, the store callback is
// called twice for each output
int fft_sz = 32; // Size of each FFT
int num_ffts = 504; // How many FFTs to do
// With these parameters, the store callback is
// called once for each output
// int fft_sz = 1024; // Size of each FFT
// int num_ffts = 20; // How many FFTs to do
// Buffers
cufftComplex *in_buf_h, *in_buf_d, *out_buf_d;
// Allocate buffers on host and device
in_buf_h = new cufftComplex[fft_sz*num_ffts];
cudaMalloc(&in_buf_d, fft_sz*num_ffts*sizeof(cufftComplex));
cudaMalloc(&out_buf_d, fft_sz*num_ffts*sizeof(cufftComplex));
// Fill input buffer with zeros and copy to device
memset(in_buf_h, 0, fft_sz*num_ffts*sizeof(cufftComplex));
cudaMemcpy(in_buf_d, in_buf_h, fft_sz*num_ffts*sizeof(cufftComplex), cudaMemcpyHostToDevice);
// Plan num_ffts of size fft_sz
cufftHandle plan;
cufftCreate(&plan);
cufftMakePlan1d(plan, fft_sz, CUFFT_C2C, num_ffts, &work_size);
// Associate save callback with plan
cufftCallbackStoreC stor_cb_ptr_h;
cudaMemcpyFromSymbol(&stor_cb_ptr_h, stor_cb_ptr_d, sizeof(stor_cb_ptr_h));
cufftXtSetCallback(plan, (void **)&stor_cb_ptr_h, CUFFT_CB_ST_COMPLEX, 0);
// Execute the plan. We don't actually care about values. The idea
// is that the store callback should be called exactly once for
// each of the fft_sz*num_ffts samples.
cufftExecC2C(plan, in_buf_d, out_buf_d, -1);
// Sync the device to flush the output
cudaDeviceSynchronize();
return 0;
}
Example output for fft_sz=32, num_ffts=504:
$ stor_cb_tst
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
In contrast, if I do 20 FFTs of size 1024, then I get the expected behavior: the store callback is called exactly once for each output. Example output for fft_sz=1024, num_ffts=20:
$ stor_cb_tst
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Am I misunderstanding something, do I have a bug, or is this a problem with cufft?
I am running this on Linux Mint, with cuda V8.0.61, g++ 5.4.0, on a GeForce GTX 1080:
$ uname -a
Linux orpheus 4.4.0-53-generic #74-Ubuntu SMP Fri Dec 2 15:59:10 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61
$ g++ --version
g++ (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 1080"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 8114 MBytes (8507752448 bytes)
(20) Multiprocessors, (128) CUDA Cores/MP: 2560 CUDA Cores
GPU Max Clock rate: 1848 MHz (1.85 GHz)
Memory Clock rate: 5005 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 1080
Result = PASS
Here is my compile command:
$ nvcc -ccbin g++ -dc -m64 -o stor_cb_tst.o -c stor_cb_tst.cu
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
$ nvcc -ccbin g++ -m64 -o stor_cb_tst stor_cb_tst.o -lcufft_static -lculibos
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
$ ./stor_cb_tst
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19