I have written a CUDA program for Synthetic Aperture Radar Image processing. The significant portion of the computation involves finding FFTs and iFFTs and I have used cuFFT library for it. I ran my CUDA code on Jetson TK1 and on a laptop having GT635M (Fermi) and I find it is three times slower on Jetson. It is because FFTs is taking more time and shows lower GFLOPS/s on Jetson. The GFLOPS/s performance of the kernels I wrote are nearly same in both Jetson and Fermi GT635M. It is the FFTs which is slow on Jetson.
The other profiler parameters I observed are:
The Issued Control Flow Instructions, Texture Cache Transactions, Local Memory Store Throughput (bytes/sec), Local Memory Store Transactions Per Request are high on Jetson while the Requested Global Load Throughput(bytes/sec) and Global Load Transactions are high on Fermi GT635M.
Jetson
GPU Clock Rate: 852 Mhz
Mem Clock Rate: 924 Mhz
Fermi GT635M
GPU Clock Rate: 950 Mhz
Mem Clock Rate: 900 Mhz
Both of them have nearly same clock frequencies. Then why is the FFTs taking more time on Jetson and shows poor GFLOPS/s ?
To see the performance of FFTs, I have written a simple CUDA program which does 1D FFT on a matrix of size 2048 * 4912. The data here is contiguous and not strided. The timetaken and GFLOPS/s for them are:
Jetson
3.251 GFLOPS/s Duration: 1.393 sec
Fermi GT635M
47.1 GFLOPS/s Duration: 0.211 sec
#include <stdio.h>
#include <cstdlib>
#include <cufft.h>
#include <stdlib.h>
#include <math.h>
#include "cuda_runtime_api.h"
#include "device_launch_parameters.h"
#include "cuda_profiler_api.h"
int main()
{
int numLines = 2048, nValid = 4912;
int iter1, iter2, index=0;
cufftComplex *devData, *hostData;
hostData = (cufftComplex*)malloc(sizeof(cufftComplex) * numLines * nValid);
for(iter1=0; iter1<2048; iter1++)
{
for(iter2=0; iter2<4912; iter2++)
{
index = iter1*4912 + iter2;
hostData[index].x = iter1+1;
hostData[index].y = iter2+1;
}
}
cudaMalloc((void**)&devData, sizeof(cufftComplex) * numLines * nValid);
cudaMemcpy(devData, hostData, sizeof(cufftComplex) * numLines * nValid, cudaMemcpyHostToDevice);
// ----------------------------
cufftHandle plan;
cufftPlan1d(&plan, 4912, CUFFT_C2C, 2048);
cufftExecC2C(plan, (cufftComplex *)devData, (cufftComplex *)devData, CUFFT_FORWARD);
cufftDestroy(plan);
// ----------------------------
cudaMemcpy(hostData, devData, sizeof(cufftComplex) * numLines * nValid, cudaMemcpyDeviceToHost);
for(iter1=0; iter1<5; iter1++)
{
for(iter2=0; iter2<5; iter2++)
{
index = iter1*4912 + iter2;
printf("%lf+i%lf \n",hostData[index].x, hostData[index].y);
}
printf("\n");
}
cudaDeviceReset();
return 0;
}