I am a beginner trying to learn how to use a GPU to perform high speed calculations.I am trying to implement a simple FFT program using GPU. Below is the program I used for calculating FFT using the CPU core.
from time import time as timer
import numpy as np
import matplotlib.pyplot as plt
winsize=512
shift=16
my_cmap='gray_r'
Fs = 8000
f = 1000
sample =200000
x = np.arange(sample)
y = np.sin(2 * np.pi * f * x / Fs)
data_len=len(y)
window_func=np.blackman(winsize)
fftdata=np.zeros((0,int(winsize/2)))
startime=timer()
for frame in range(0, data_len, shift):
#==============================================================================
# if frame>0:
# break
#==============================================================================
kiri=y[frame:frame+winsize]
if len(kiri) != winsize:
break
windata = window_func * kiri
fftframe=np.fft.fft(windata,n=winsize)
magframe=np.abs(fftframe)**2
powerframe=np.log10(magframe):int(winsize/2)].reshape((1,int(winsize/2)))
fftdata=np.append(fftdata,powerframe,axis=0)
endtime=timer()-startime
fftdata=np.asarray(fftdata)
fftfrq=np.fft.fftfreq(winsize,d=1/Fs)[:int(winsize/2)]
print("CPU runtime:",endtime,"sec")
Now below figure is the output as spectrogram when plotted using imshow()
function:
Timing output is as follows:
CPU runtime: 65.02100014686584 sec
Now I rewrite the above program to use the GPU of my PC which is Quadro K2200 using pyculib and numba packages offered by Anaconda.
from time import time as timer
import numpy as np
import pyculib.fft
from numba import cuda
import matplotlib.pyplot as plt
winsize=512
shift=16
Fs = 8000
f = 1000
sample =200000
t = np.arange(sample,dtype=np.float64)
y = np.sin(2 * np.pi * f * t / Fs)
my_cmap='gray_r'
data_len=len(y)
window_func=np.blackman(winsize)
fftdata_gpu=np.zeros((0,int(winsize/2)))
startime=timer()
for frame in range(0, data_len, shift):
# =============================================================================
# if frame>0:
# break
# =============================================================================
kiri=y[frame:frame+winsize]
if len(kiri) != winsize:
break
windata = window_func * kiri
fftframe_gpu = np.zeros(winsize, np.complex128)
d_xf_gpu = cuda.to_device(fftframe_gpu)
pyculib.fft.fft(windata.astype(np.complex128),d_xf_gpu)
d_xf_gpu.copy_to_host(fftframe_gpu)
magframe_gpu=np.abs(fftframe_gpu)**2
powerframe_gpu=np.log10(magframe_gpu)[:int(winsize/2)].reshape((1,int(winsize/2)))
fftdata_gpu=np.append(fftdata_gpu,powerframe_gpu,axis=0)
endtime=timer()-startime
fftdata_gpu=np.asarray(fftdata_gpu)
print("GPU runtime:",endtime,"sec")
The timing output when I run the above program shows that the GPU implementation actually took 30 seconds longer.
GPU runtime: 92.87200021743774 sec
I am guessing this is because I am repeatedly copying the arrays on to the device and getting it back for every frame. Is there a better way to implement this ? I would really like to know any opinions on what I am doing wrong here.
Below I paste the output of the GPU implementation.
Edit:Adding the profiling results
For CPU code: the first 20 function calls sorted based on cumulative time
569255 function calls (563691 primitive calls) in 64.792 seconds
Ordered by: cumulative time
List reduced from 2594 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
296/1 0.009 0.000 64.793 64.793 {built-in method builtins.exec}
1 11.489 11.489 64.793 64.793 cuda_fft_tr1_cpu.py:6(<module>)
12469 0.037 0.000 52.145 0.004 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numpy\lib\function_base.py:5095(append)
12469 52.093 0.004 52.093 0.004 {built-in method numpy.core.multiarray.concatenate}
12469 0.073 0.000 0.622 0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numpy\fft\fftpack.py:102(fft)
333/2 0.002 0.000 0.493 0.247 <frozen importlib._bootstrap>:966(_find_and_load)
333/2 0.001 0.000 0.493 0.246 <frozen importlib._bootstrap>:936(_find_and_load_unlocked)
323/3 0.002 0.000 0.491 0.164 <frozen importlib._bootstrap>:651(_load_unlocked)
272/3 0.001 0.000 0.491 0.164 <frozen importlib._bootstrap_external>:672(exec_module)
432/3 0.000 0.000 0.490 0.163 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
12469 0.073 0.000 0.415 0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numpy\fft\fftpack.py:47(_raw_fft)
356/24 0.000 0.000 0.336 0.014 {built-in method builtins.__import__}
1 0.000 0.000 0.232 0.232 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\matplotlib\pyplot.py:17(<module>)
1445/628 0.001 0.000 0.220 0.000 <frozen importlib._bootstrap>:997(_handle_fromlist)
1 0.000 0.000 0.141 0.141 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numpy\__init__.py:106(<module>)
12469 0.139 0.000 0.139 0.000 {built-in method numpy.fft.fftpack_lite.cfftf}
328 0.003 0.000 0.122 0.000 <frozen importlib._bootstrap>:870(_find_spec)
310 0.000 0.000 0.117 0.000 <frozen importlib._bootstrap_external>:1149(find_spec)
310 0.002 0.000 0.117 0.000 <frozen importlib._bootstrap_external>:1117(_get_spec)
1 0.000 0.000 0.115 0.115 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\matplotlib\__init__.py:101(<module>)
For GPU code: sorting based on cumulative time
5689881 function calls (5642977 primitive calls) in 94.179 seconds
Ordered by: cumulative time
List reduced from 4373 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
590/1 0.019 0.000 94.207 94.207 {built-in method builtins.exec}
1 12.080 12.080 94.207 94.207 cuda_fft_tr1_gpu.py:6(<module>)
12469 0.046 0.000 51.752 0.004 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numpy\lib\function_base.py:5095(append)
12469 51.679 0.004 51.679 0.004 {built-in method numpy.core.multiarray.concatenate}
62345/49876 0.131 0.000 22.880 0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\cudadrv\devices.py:209(_require_cuda_context)
12469 0.111 0.000 20.875 0.002 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\pyculib\fft\api.py:190(fft)
49876 17.129 0.000 17.171 0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\pyculib\utils\libutils.py:40(wrapped)
12469 0.176 0.000 15.354 0.001 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\pyculib\fft\api.py:38(__init__)
12469 0.284 0.000 15.046 0.001 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\pyculib\fft\binding.py:207(many)
37407 0.151 0.000 7.634 0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\cudadrv\devicearray.py:451(auto_device)
24938 0.108 0.000 4.707 0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\cudadrv\devicearray.py:422(from_array_like)
12469 0.043 0.000 4.644 0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\pyculib\fft\api.py:134(forward)
24938 0.356 0.000 4.599 0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\cudadrv\devicearray.py:58(__init__)
87290 4.156 0.000 4.398 0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\cudadrv\driver.py:284(safe_cuda_api_call)
12469 0.035 0.000 4.292 0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\api.py:26(to_device)
12469 0.064 0.000 3.468 0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\pyculib\fft\api.py:86(_prepare)
24938 0.027 0.000 3.404 0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\api.py:275(_auto_device)
24938 0.114 0.000 2.452 0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\cudadrv\devicearray.py:139(copy_to_device)
24938 0.080 0.000 2.157 0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\cudadrv\driver.py:1573(host_to_device)
24938 0.261 0.000 1.848 0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\cudadrv\driver.py:667(memalloc)
the pyculib's fft function takes 20 sec approx while numpy fft takes ~0.6 sec. Why is the pyculib's function taking so long ? Is there any way to imporve the code so as to shorten this time ? Or is it better to use a different library ?