I'm looking to parallelize multiple 1d FFTs using CUDA. I'm working on a GTX 1050Ti with CUDA 6.1.
For instance in the code I attached, I have a 3d input array 'data', and I want to do 1d FFTs over the second dimension of this array. The purpose is, of course, to speed up the execution time by an order of magnitude.
I'm able to use Python's scikit-cuda's cufft package to run a batch of 1 1d FFT and the results match with NumPy's FFT. The problem comes when I go to a real batch size. There, I'm not able to match the NumPy's FFT output (which is the correct one) with cufft's output (which I believe isn't correct). In the code attached, parameter 'singleFFT' controls whether we schedule a batch of 1 or many. Help in correcting the output FFT and also speeding up execution further (if possible) will be greatly appreciated.
import numpy as np
from time import process_time
from skcuda import cufft as cf
import pycuda.autoinit
from pycuda import gpuarray
# params
nSamp = 512
nTx = 16
nRx = 16
nChirp = 256
NX = nChirp
# Uncomment the following line to generate same data always
# np.random.seed(seed=1)
data = (np.random.randn(nSamp,nChirp,nTx,nRx) + 1j*np.random.randn(nSamp,nChirp,nTx,nRx)).astype(np.complex64)
data = data.reshape(nSamp,-1,nTx*nRx)
dataShp0 = np.int32(data.shape[0])
dataShp2 = np.int32(data.shape[2])
idx1 = 0
idx2 = 0
idx3 = 0
singleFFT = 0
if (1 == singleFFT):
data_t = data[0,:,0]
fftAxis = 0
BATCH = np.int32(1)
else:
data_t = data
fftAxis = 1
BATCH = np.int32(nSamp*nTx*nRx)
# calculate and time NumPy FFT
t1 = process_time()
dataFft = np.fft.fft(data_t, axis=fftAxis)
t2 = process_time()
print('\nCPU NumPy time is: ',t2-t1)
data_o_gpu = gpuarray.empty((BATCH*NX),dtype=np.complex64)
# calculate and time GPU FFT
data_t = data_t.reshape((BATCH*NX))
t1 = process_time()
# transfer input data to Device
data_t_gpu = gpuarray.to_gpu(data_t)
# Make FFT plan
plan = cf.cufftPlan1d(NX, cf.CUFFT_C2C, BATCH)
# Execute FFT plan
res = cf.cufftExecC2C(plan, int(data_t_gpu.gpudata), int(data_o_gpu.gpudata), cf.CUFFT_FORWARD)
dataFft_gpu = data_o_gpu.get()
t2 = process_time()
if (0 == singleFFT):
dataFft_gpu = dataFft_gpu.reshape((nSamp,-1,nTx*nRx))
print('\nGPU time is: ',t2-t1)
print(np.allclose(dataFft,dataFft_gpu,atol=1e-6))
The last line in the code matches the result of NumPy's FFT with cuFFT. It could be seen with singleFFT=1, the result is True, while for singleFFT=0 (i.e. batch of many 1d FFTs), the result is False.