0

I have written a CUDA program for Synthetic Aperture Radar Image processing. The significant portion of the computation involves finding FFTs and iFFTs and I have used cuFFT library for it. I ran my CUDA code on Jetson TK1 and on a laptop having GT635M (Fermi) and I find it is three times slower on Jetson. It is because FFTs is taking more time and shows lower GFLOPS/s on Jetson. The GFLOPS/s performance of the kernels I wrote are nearly same in both Jetson and Fermi GT635M. It is the FFTs which is slow on Jetson.
The other profiler parameters I observed are: The Issued Control Flow Instructions, Texture Cache Transactions, Local Memory Store Throughput (bytes/sec), Local Memory Store Transactions Per Request are high on Jetson while the Requested Global Load Throughput(bytes/sec) and Global Load Transactions are high on Fermi GT635M.

Jetson

GPU Clock Rate: 852 Mhz

Mem Clock Rate: 924 Mhz

Fermi GT635M

GPU Clock Rate: 950 Mhz

Mem Clock Rate: 900 Mhz

Both of them have nearly same clock frequencies. Then why is the FFTs taking more time on Jetson and shows poor GFLOPS/s ?

To see the performance of FFTs, I have written a simple CUDA program which does 1D FFT on a matrix of size 2048 * 4912. The data here is contiguous and not strided. The timetaken and GFLOPS/s for them are:

Jetson

3.251 GFLOPS/s Duration: 1.393 sec

Fermi GT635M

47.1 GFLOPS/s Duration: 0.211 sec

#include <stdio.h>
#include <cstdlib>
#include <cufft.h>
#include <stdlib.h>
#include <math.h>
#include "cuda_runtime_api.h"
#include "device_launch_parameters.h"
#include "cuda_profiler_api.h"

int main()
{
    int numLines = 2048, nValid = 4912;

    int iter1, iter2, index=0;
    cufftComplex *devData, *hostData; 

    hostData = (cufftComplex*)malloc(sizeof(cufftComplex) * numLines * nValid);
    for(iter1=0; iter1<2048; iter1++)
    {
        for(iter2=0; iter2<4912; iter2++)
        {
             index = iter1*4912 + iter2;
             hostData[index].x = iter1+1;
             hostData[index].y = iter2+1;
        }
    }

    cudaMalloc((void**)&devData, sizeof(cufftComplex) * numLines * nValid);
    cudaMemcpy(devData, hostData, sizeof(cufftComplex) * numLines * nValid, cudaMemcpyHostToDevice);

    // ----------------------------
    cufftHandle plan;
    cufftPlan1d(&plan, 4912, CUFFT_C2C, 2048);

    cufftExecC2C(plan, (cufftComplex *)devData, (cufftComplex *)devData, CUFFT_FORWARD);
    cufftDestroy(plan);
    // ----------------------------

    cudaMemcpy(hostData, devData, sizeof(cufftComplex) * numLines * nValid, cudaMemcpyDeviceToHost);
    for(iter1=0; iter1<5; iter1++)
    {
        for(iter2=0; iter2<5; iter2++)
        {
            index = iter1*4912 + iter2;
            printf("%lf+i%lf \n",hostData[index].x, hostData[index].y);
        }
        printf("\n");
    }

    cudaDeviceReset();
    return 0;
}
Raj
  • 31
  • 3
  • 1
    The memory bandwidth of a [GTX 635M](http://www.geforce.com/hardware/notebook-gpus/geforce-gt-635m/specifications) is listed as 43.2 GB/sec. By comparison, the bandwidth of a Jetson TK1 is around 15 GB/sec (according to various internet sources, I can't find an official NVIDIA spec). Since FFT is generally bandwidth limited, the roughly 3:1 ratio in memory bandwidth is reflected in the roughly 3:1 ratio in FFT performance you observe. This seems mostly like a "general hardware and software question", which are off-topic here, so don't be too surprised if the question gets closed. – njuffa Feb 25 '15 at 18:54

2 Answers2

0

This could be probably because you are using the LP (low power CPU).

Checkout this document to enable all the 4 main ARM cores (HP cluster) to take advantage of Hyper-Q.

I faced a similar issue.. After activating the main HP cluster I get good performance (from 3 GFLOPS (LP) to 160 GFLOPS (HP)).

the swine
  • 10,713
  • 7
  • 58
  • 100
  • Thanks Suriya. I checked the configuration. Only one of the cores in cluster G (HP) is active by default. I turned on other cores also. But I don't see any improvement in GFLOPS/s. When I set GPU clock frequency to max 852 MHz from default 72 MHz, there is improvement in GFLOPS/s from 2.3 GFLOPS/s to around 24 GFLOPS/s for FFT on size 1 * 8192 (contiguous memory location). When the FFT is done on large matrix or strided access, GFLOPS/s is VERY poor (1.5 GLFOPS/s). And, how did you achieve 160 GFLOPS/s from 3 GFLOPS/s by turning on the cores? Can you say something about your application? – Raj Apr 12 '15 at 07:46
-1

My blind guess is that, though the TK1 has a more modern core, the memory bandwidth dedicatedly available to the 144 cores of your 635M is significantly higher than that of the Tegra.

Furthermore, CUDA is always a bit picky on the warp/thread/grid sizes, so it's perfectly possible that the cufft algorithms were optimized for the local storage sizes of Fermis, and don't work as well with the Keplers.

Marcus Müller
  • 34,677
  • 4
  • 53
  • 94
  • Is there anything I can do to improve GFLOPS/s performance for the FFT? I'm using the library and I'm not sure if anything can be done to improve GFLOPS/s. – Raj Feb 25 '15 at 19:00
  • 1
    As I pointed out in my earlier comment, FFT performance is typically limited by memory bandwidth, not floating-point throughput. There is nothing you can do to improve memory bandwidth, that is limited by hardware. NVIDIA [claims](http://devblogs.nvidia.com/parallelforall/cuda-7-release-candidate-feature-overview/) improved performance for certain FFT configurations in CUDA 7.0, so you may want to give the CUDA 7.0 release candidate a try. – njuffa Feb 25 '15 at 19:17