pyculib fft using gpu: speed up

Question

I am a beginner trying to learn how to use a GPU to perform high speed calculations.I am trying to implement a simple FFT program using GPU. Below is the program I used for calculating FFT using the CPU core.

from time import time as timer
import numpy as np
import matplotlib.pyplot as plt
winsize=512
shift=16
my_cmap='gray_r'
Fs = 8000
f = 1000
sample =200000
x = np.arange(sample)
y = np.sin(2 * np.pi * f * x / Fs)

data_len=len(y)
window_func=np.blackman(winsize)
fftdata=np.zeros((0,int(winsize/2)))

startime=timer()
for frame in range(0, data_len, shift):
#==============================================================================
#     if frame>0:
#         break
#==============================================================================
    kiri=y[frame:frame+winsize]
    if len(kiri) != winsize:
        break
    windata = window_func * kiri
    fftframe=np.fft.fft(windata,n=winsize)
    magframe=np.abs(fftframe)**2
    powerframe=np.log10(magframe):int(winsize/2)].reshape((1,int(winsize/2)))
    fftdata=np.append(fftdata,powerframe,axis=0)
endtime=timer()-startime
fftdata=np.asarray(fftdata)
fftfrq=np.fft.fftfreq(winsize,d=1/Fs)[:int(winsize/2)]
print("CPU runtime:",endtime,"sec")

Now below figure is the output as spectrogram when plotted using imshow() function:

Timing output is as follows:

 CPU runtime: 65.02100014686584 sec

Now I rewrite the above program to use the GPU of my PC which is Quadro K2200 using pyculib and numba packages offered by Anaconda.

from time import time as timer
import numpy as np
import pyculib.fft
from numba import cuda
import matplotlib.pyplot as plt
winsize=512
shift=16
Fs = 8000
f = 1000
sample =200000
t = np.arange(sample,dtype=np.float64)
y = np.sin(2 * np.pi * f * t / Fs)
my_cmap='gray_r'
data_len=len(y)
window_func=np.blackman(winsize)
fftdata_gpu=np.zeros((0,int(winsize/2)))

startime=timer()
for frame in range(0, data_len, shift):
# =============================================================================
#     if frame>0:
#          break
# =============================================================================
    kiri=y[frame:frame+winsize]
    if len(kiri) != winsize:
        break
    windata = window_func * kiri
    fftframe_gpu = np.zeros(winsize, np.complex128)
    d_xf_gpu = cuda.to_device(fftframe_gpu)
    pyculib.fft.fft(windata.astype(np.complex128),d_xf_gpu)
    d_xf_gpu.copy_to_host(fftframe_gpu)
    magframe_gpu=np.abs(fftframe_gpu)**2
    powerframe_gpu=np.log10(magframe_gpu)[:int(winsize/2)].reshape((1,int(winsize/2)))
    fftdata_gpu=np.append(fftdata_gpu,powerframe_gpu,axis=0)
endtime=timer()-startime
fftdata_gpu=np.asarray(fftdata_gpu)
print("GPU runtime:",endtime,"sec")

The timing output when I run the above program shows that the GPU implementation actually took 30 seconds longer.

GPU runtime: 92.87200021743774 sec

I am guessing this is because I am repeatedly copying the arrays on to the device and getting it back for every frame. Is there a better way to implement this ? I would really like to know any opinions on what I am doing wrong here.

Below I paste the output of the GPU implementation.

Edit:Adding the profiling results

For CPU code: the first 20 function calls sorted based on cumulative time

   569255 function calls (563691 primitive calls) in 64.792 seconds

   Ordered by: cumulative time
   List reduced from 2594 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    296/1    0.009    0.000   64.793   64.793 {built-in method builtins.exec}
        1   11.489   11.489   64.793   64.793 cuda_fft_tr1_cpu.py:6(<module>)
    12469    0.037    0.000   52.145    0.004 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numpy\lib\function_base.py:5095(append)
    12469   52.093    0.004   52.093    0.004 {built-in method numpy.core.multiarray.concatenate}
    12469    0.073    0.000    0.622    0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numpy\fft\fftpack.py:102(fft)
    333/2    0.002    0.000    0.493    0.247 <frozen importlib._bootstrap>:966(_find_and_load)
    333/2    0.001    0.000    0.493    0.246 <frozen importlib._bootstrap>:936(_find_and_load_unlocked)
    323/3    0.002    0.000    0.491    0.164 <frozen importlib._bootstrap>:651(_load_unlocked)
    272/3    0.001    0.000    0.491    0.164 <frozen importlib._bootstrap_external>:672(exec_module)
    432/3    0.000    0.000    0.490    0.163 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
    12469    0.073    0.000    0.415    0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numpy\fft\fftpack.py:47(_raw_fft)
   356/24    0.000    0.000    0.336    0.014 {built-in method builtins.__import__}
        1    0.000    0.000    0.232    0.232 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\matplotlib\pyplot.py:17(<module>)
 1445/628    0.001    0.000    0.220    0.000 <frozen importlib._bootstrap>:997(_handle_fromlist)
        1    0.000    0.000    0.141    0.141 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numpy\__init__.py:106(<module>)
    12469    0.139    0.000    0.139    0.000 {built-in method numpy.fft.fftpack_lite.cfftf}
      328    0.003    0.000    0.122    0.000 <frozen importlib._bootstrap>:870(_find_spec)
      310    0.000    0.000    0.117    0.000 <frozen importlib._bootstrap_external>:1149(find_spec)
      310    0.002    0.000    0.117    0.000 <frozen importlib._bootstrap_external>:1117(_get_spec)
        1    0.000    0.000    0.115    0.115 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\matplotlib\__init__.py:101(<module>)

For GPU code: sorting based on cumulative time

5689881 function calls (5642977 primitive calls) in 94.179 seconds

   Ordered by: cumulative time
   List reduced from 4373 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    590/1    0.019    0.000   94.207   94.207 {built-in method builtins.exec}
        1   12.080   12.080   94.207   94.207 cuda_fft_tr1_gpu.py:6(<module>)
    12469    0.046    0.000   51.752    0.004 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numpy\lib\function_base.py:5095(append)
    12469   51.679    0.004   51.679    0.004 {built-in method numpy.core.multiarray.concatenate}
62345/49876    0.131    0.000   22.880    0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\cudadrv\devices.py:209(_require_cuda_context)
    12469    0.111    0.000   20.875    0.002 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\pyculib\fft\api.py:190(fft)
    49876   17.129    0.000   17.171    0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\pyculib\utils\libutils.py:40(wrapped)
    12469    0.176    0.000   15.354    0.001 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\pyculib\fft\api.py:38(__init__)
    12469    0.284    0.000   15.046    0.001 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\pyculib\fft\binding.py:207(many)
    37407    0.151    0.000    7.634    0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\cudadrv\devicearray.py:451(auto_device)
    24938    0.108    0.000    4.707    0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\cudadrv\devicearray.py:422(from_array_like)
    12469    0.043    0.000    4.644    0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\pyculib\fft\api.py:134(forward)
    24938    0.356    0.000    4.599    0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\cudadrv\devicearray.py:58(__init__)
    87290    4.156    0.000    4.398    0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\cudadrv\driver.py:284(safe_cuda_api_call)
    12469    0.035    0.000    4.292    0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\api.py:26(to_device)
    12469    0.064    0.000    3.468    0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\pyculib\fft\api.py:86(_prepare)
    24938    0.027    0.000    3.404    0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\api.py:275(_auto_device)
    24938    0.114    0.000    2.452    0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\cudadrv\devicearray.py:139(copy_to_device)
    24938    0.080    0.000    2.157    0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\cudadrv\driver.py:1573(host_to_device)
    24938    0.261    0.000    1.848    0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\cudadrv\driver.py:667(memalloc)

the pyculib's fft function takes 20 sec approx while numpy fft takes ~0.6 sec. Why is the pyculib's function taking so long ? Is there any way to imporve the code so as to shorten this time ? Or is it better to use a different library ?

You seem to be assuming that the FFT is the performance bottleneck in your Python code. But is that really the case? Have you profiled it? — talonmies, Oct 20 '17 at 14:25
No I havent profiled it. I came to that conclusion because the timing is done for the for loop where I calculate the FFT. The increase of 30 sec in runtime corresponds to the time it takes for the for loop alone. The conclusion seemed reasonable. What else might be the bottleneck ? — Kanmani, Oct 25 '17 at 23:57
Also I am not sure this code is actually using the GPU for calculation. When the code is running I checked the GPU usage using msi afterburner application and the percentage barely went higher than 2%. Is this calculation too lightweight to really use the GPU resources ? — Kanmani, Oct 26 '17 at 00:24

pyculib fft using gpu: speed up

0 Answers0