FFT is slower on Jetson TK1?

Question

I have written a CUDA program for Synthetic Aperture Radar Image processing. The significant portion of the computation involves finding FFTs and iFFTs and I have used cuFFT library for it. I ran my CUDA code on Jetson TK1 and on a laptop having GT635M (Fermi) and I find it is three times slower on Jetson. It is because FFTs is taking more time and shows lower GFLOPS/s on Jetson. The GFLOPS/s performance of the kernels I wrote are nearly same in both Jetson and Fermi GT635M. It is the FFTs which is slow on Jetson.
The other profiler parameters I observed are: The Issued Control Flow Instructions, Texture Cache Transactions, Local Memory Store Throughput (bytes/sec), Local Memory Store Transactions Per Request are high on Jetson while the Requested Global Load Throughput(bytes/sec) and Global Load Transactions are high on Fermi GT635M.

Jetson

GPU Clock Rate: 852 Mhz

Mem Clock Rate: 924 Mhz

Fermi GT635M

GPU Clock Rate: 950 Mhz

Mem Clock Rate: 900 Mhz

Both of them have nearly same clock frequencies. Then why is the FFTs taking more time on Jetson and shows poor GFLOPS/s ?

To see the performance of FFTs, I have written a simple CUDA program which does 1D FFT on a matrix of size 2048 * 4912. The data here is contiguous and not strided. The timetaken and GFLOPS/s for them are:

Jetson

3.251 GFLOPS/s Duration: 1.393 sec

Fermi GT635M

47.1 GFLOPS/s Duration: 0.211 sec

#include <stdio.h>
#include <cstdlib>
#include <cufft.h>
#include <stdlib.h>
#include <math.h>
#include "cuda_runtime_api.h"
#include "device_launch_parameters.h"
#include "cuda_profiler_api.h"

int main()
{
    int numLines = 2048, nValid = 4912;

    int iter1, iter2, index=0;
    cufftComplex *devData, *hostData; 

    hostData = (cufftComplex*)malloc(sizeof(cufftComplex) * numLines * nValid);
    for(iter1=0; iter1<2048; iter1++)
    {
        for(iter2=0; iter2<4912; iter2++)
        {
             index = iter1*4912 + iter2;
             hostData[index].x = iter1+1;
             hostData[index].y = iter2+1;
        }
    }

    cudaMalloc((void**)&devData, sizeof(cufftComplex) * numLines * nValid);
    cudaMemcpy(devData, hostData, sizeof(cufftComplex) * numLines * nValid, cudaMemcpyHostToDevice);

    // ----------------------------
    cufftHandle plan;
    cufftPlan1d(&plan, 4912, CUFFT_C2C, 2048);

    cufftExecC2C(plan, (cufftComplex *)devData, (cufftComplex *)devData, CUFFT_FORWARD);
    cufftDestroy(plan);
    // ----------------------------

    cudaMemcpy(hostData, devData, sizeof(cufftComplex) * numLines * nValid, cudaMemcpyDeviceToHost);
    for(iter1=0; iter1<5; iter1++)
    {
        for(iter2=0; iter2<5; iter2++)
        {
            index = iter1*4912 + iter2;
            printf("%lf+i%lf \n",hostData[index].x, hostData[index].y);
        }
        printf("\n");
    }

    cudaDeviceReset();
    return 0;
}

The memory bandwidth of a [GTX 635M](http://www.geforce.com/hardware/notebook-gpus/geforce-gt-635m/specifications) is listed as 43.2 GB/sec. By comparison, the bandwidth of a Jetson TK1 is around 15 GB/sec (according to various internet sources, I can't find an official NVIDIA spec). Since FFT is generally bandwidth limited, the roughly 3:1 ratio in memory bandwidth is reflected in the roughly 3:1 ratio in FFT performance you observe. This seems mostly like a "general hardware and software question", which are off-topic here, so don't be too surprised if the question gets closed. — njuffa, Feb 25 '15 at 18:54

score 0 · Answer 1 · edited Mar 26 '15 at 06:43

0

This could be probably because you are using the LP (low power CPU).

Checkout this document to enable all the 4 main ARM cores (HP cluster) to take advantage of Hyper-Q.

I faced a similar issue.. After activating the main HP cluster I get good performance (from 3 GFLOPS (LP) to 160 GFLOPS (HP)).

edited Mar 26 '15 at 06:43

the swine

10,713
7
58
100

answered Mar 25 '15 at 10:22

Suriya Narayanan

1

Thanks Suriya. I checked the configuration. Only one of the cores in cluster G (HP) is active by default. I turned on other cores also. But I don't see any improvement in GFLOPS/s. When I set GPU clock frequency to max 852 MHz from default 72 MHz, there is improvement in GFLOPS/s from 2.3 GFLOPS/s to around 24 GFLOPS/s for FFT on size 1 * 8192 (contiguous memory location). When the FFT is done on large matrix or strided access, GFLOPS/s is VERY poor (1.5 GLFOPS/s). And, how did you achieve 160 GFLOPS/s from 3 GFLOPS/s by turning on the cores? Can you say something about your application? – Raj Apr 12 '15 at 07:46

score -1 · Accepted Answer · answered Feb 25 '15 at 16:19

-1

My blind guess is that, though the TK1 has a more modern core, the memory bandwidth dedicatedly available to the 144 cores of your 635M is significantly higher than that of the Tegra.

Furthermore, CUDA is always a bit picky on the warp/thread/grid sizes, so it's perfectly possible that the cufft algorithms were optimized for the local storage sizes of Fermis, and don't work as well with the Keplers.

answered Feb 25 '15 at 16:19

Marcus Müller

34,677
4
53
94

Is there anything I can do to improve GFLOPS/s performance for the FFT? I'm using the library and I'm not sure if anything can be done to improve GFLOPS/s. – Raj Feb 25 '15 at 19:00
1

As I pointed out in my earlier comment, FFT performance is typically limited by memory bandwidth, not floating-point throughput. There is nothing you can do to improve memory bandwidth, that is limited by hardware. NVIDIA [claims](http://devblogs.nvidia.com/parallelforall/cuda-7-release-candidate-feature-overview/) improved performance for certain FFT configurations in CUDA 7.0, so you may want to give the CUDA 7.0 release candidate a try. – njuffa Feb 25 '15 at 19:17

FFT is slower on Jetson TK1?

2 Answers2