1

I'm looking to parallelize multiple 1d FFTs using CUDA. I'm working on a GTX 1050Ti with CUDA 6.1.

For instance in the code I attached, I have a 3d input array 'data', and I want to do 1d FFTs over the second dimension of this array. The purpose is, of course, to speed up the execution time by an order of magnitude.

I'm able to use Python's scikit-cuda's cufft package to run a batch of 1 1d FFT and the results match with NumPy's FFT. The problem comes when I go to a real batch size. There, I'm not able to match the NumPy's FFT output (which is the correct one) with cufft's output (which I believe isn't correct). In the code attached, parameter 'singleFFT' controls whether we schedule a batch of 1 or many. Help in correcting the output FFT and also speeding up execution further (if possible) will be greatly appreciated.

import numpy as np
from time import process_time
from skcuda import cufft as cf
import pycuda.autoinit
from pycuda import gpuarray


# params
nSamp = 512
nTx = 16
nRx = 16
nChirp = 256
NX = nChirp
# Uncomment the following line to generate same data always
# np.random.seed(seed=1)
data = (np.random.randn(nSamp,nChirp,nTx,nRx) + 1j*np.random.randn(nSamp,nChirp,nTx,nRx)).astype(np.complex64)
data = data.reshape(nSamp,-1,nTx*nRx)
dataShp0 = np.int32(data.shape[0])
dataShp2 = np.int32(data.shape[2])
idx1 = 0
idx2 = 0
idx3 = 0
singleFFT = 0
if (1 == singleFFT):
    data_t      = data[0,:,0]
    fftAxis = 0
    BATCH = np.int32(1)
else:
    data_t      = data
    fftAxis = 1
    BATCH = np.int32(nSamp*nTx*nRx)
# calculate and time NumPy FFT
t1 = process_time()
dataFft     = np.fft.fft(data_t, axis=fftAxis)
t2 = process_time()
print('\nCPU NumPy time is: ',t2-t1)

data_o_gpu  = gpuarray.empty((BATCH*NX),dtype=np.complex64)
# calculate and time GPU FFT
data_t = data_t.reshape((BATCH*NX))
t1 = process_time()
# transfer input data to Device
data_t_gpu  = gpuarray.to_gpu(data_t)
# Make FFT plan
plan = cf.cufftPlan1d(NX, cf.CUFFT_C2C, BATCH)
# Execute FFT plan
res = cf.cufftExecC2C(plan, int(data_t_gpu.gpudata), int(data_o_gpu.gpudata), cf.CUFFT_FORWARD)

dataFft_gpu = data_o_gpu.get()
t2 = process_time()
if (0 == singleFFT):
    dataFft_gpu = dataFft_gpu.reshape((nSamp,-1,nTx*nRx))
print('\nGPU time is: ',t2-t1)
print(np.allclose(dataFft,dataFft_gpu,atol=1e-6))

The last line in the code matches the result of NumPy's FFT with cuFFT. It could be seen with singleFFT=1, the result is True, while for singleFFT=0 (i.e. batch of many 1d FFTs), the result is False.

talonmies
  • 70,661
  • 34
  • 192
  • 269
Ankit_85
  • 31
  • 1
  • 4
  • [here](https://devtalk.nvidia.com/default/topic/1062896/gpu-accelerated-libraries/multiple-batches-of-1d-fft-using-cufft/post/5383719/#5383719) is an example using cupy. – Robert Crovella Sep 15 '19 at 05:02
  • 1
    Welcome to Stackoverflow! Since the DFT is not performed on the last dimension, the actual 1D array on which the DFT is to be applied are likely interleaved. You may try [`cufftPlanMany()`](https://docs.nvidia.com/cuda/cufft/index.html#advanced-data-layout) as it supports batched input and strided data layouts. `rank` is going to be 1, `istride` and `ostride` are likely going to be `nTx*nRx`, `idist` and `odist` likely equal 1 and `batch` is going to be `nTx*nRx`. The plan is to be executed `nSamp` times, using `int(data_t_gpu.gpudata)+i*nTx*nRx*nChirp` as input and output. – francis Sep 15 '19 at 12:03
  • Thank you Robert and Francis. I will try out the suggestions listed and get back. Thanks again. – Ankit_85 Sep 19 '19 at 10:16
  • Post my attempts, I'm impelled to conclude that: – Ankit_85 Oct 03 '19 at 08:42

1 Answers1

0

Post my attempts, I would want to conclude that:

  • Using cufft library from skcuda is a bit tricky and to get to the correct FFT output might take a long time, in development. I also noticed that there wasn't an order of magnitude difference in execution time between NumPy's FFT and cufft's FFT (from skcuda)

  • Using CuPy and arranging your data in a format so that the FFT dimension is laid out in contiguous memory gives an order of magnitude improvement in the FFT compute time. For my case, the order was a little better than 10!

  • Using CuPy for FFTs is a great option if one wants to stick to Py-based development only. Also the to and fro from C to Python when writing C GPU kernels is an added overhead which is very conveniently resolved with CuPy. Though CuPy itself calls laying out the plan and calling the FFT exec engine internally.

Ankit_85
  • 31
  • 1
  • 4