How to determine CUDA gpu performance?

Question

I am writing a cuda program for matching each input image of resolution ~180X180, with about 10,000 template images of resolution ~128*128. The goal is to achieve realtime performance i.e. Template matching of 25~30 input images(each with all 10,000 templates) in 1 second.

currently I am using following approach

Preloaded all templates on GPU global memory to save runtime I/O operations.
Created a single kernel to match one source Image with all the template images, and return an array for all positive matches.
Doing all operations in time-domain (not using FFT). reason being, I tried Radix-4 fft implementation but it requires a lot of intermediate Global reads and writes ending up taking more time.

so far for 1 input Image to 10,000 templates,it is taking around 2 seconds.

My questions are:

Is there is way to determine if this task is achievable in realtime or not? I mean with the help of maximum FLOPS and I/O bandwidth limitations e.t.c.
How to compute if the GPU is being fully utilized at its maximum?
Possible ways to improve the performance?

Machine specs: [i7-4770, 8GB, GTX-680]

Explanation of current kernel code:

all the template images [size is about 128X128 in RGB] are per-loaded on GPU memory. Idea is to save I/O during runtime operation.
Every input image is loaded on Texture memory, reason being Texture are good option for 2D addressing.
Every "Block" has 1024 threads.
Each thread computes the value for each output pixel,size of output is [31X31 = 961 pixels].
Number of Blocks launched are equal to number of template images being matched.

Kernel Code:

__global__ void cudaMatchTemplate(TemplateArray *templates, uchar *Match)
{
    int global = blockIdx.x*blockDim.x + threadIdx.x;

    __shared__ int idx[TEMPLATE_MATCH_DIM];
    __shared__ float out_shared[TEMPLATE_MATCH_DIM];

    //halving the template size....
    int rows = (templates[blockIdx.x].nHeight)/2;
    int cols = (templates[blockIdx.x].nWidth)/2;

    int fullCol = templates[blockIdx.x].nWidth;

    int x = templates[blockIdx.x].nMatchLeft;
    int y = templates[blockIdx.x].nMatchTop;

    int offset_y =  (threadIdx.x/TEMPLATE_MATCH_SIZE);
    int offset_x =  (threadIdx.x - offset_y*TEMPLATE_MATCH_SIZE);

    // *************** Performing match in time domain *****************************//
    int sum = 0;
    float temp;
    int idxXFactor = 3*(2*(offset_x) + x);
    int idxYFactor = 2*(offset_y) + y ;
    
    for (int i = 0; i < rows; i++)
    {
        int I=3*i*fullCol;
        int sourceIdxY = idxYFactor + 2*i;
        for (int j = 0; j < cols; j++)
        {
            int J=3*j;
            int sourceIdxX = idxXFactor + 2*J;          
            int templateIdx = 2*I+2*J;
            //**** R *****//
            temp = float(tex2D(SourceImgColorTex,sourceIdxX,sourceIdxY)) - float(templates[blockIdx.x].pRAWPixels_gpu[templateIdx]);
            sum = sum + temp*temp;
            //**** G *****//
            temp = float(tex2D(SourceImgColorTex,sourceIdxX+1,sourceIdxY)) - float(templates[blockIdx.x].pRAWPixels_gpu[templateIdx +1]);
            sum = sum + temp*temp;
            //**** B *****//
            temp = float(tex2D(SourceImgColorTex,sourceIdxX+2,sourceIdxY)) - float(templates[blockIdx.x].pRAWPixels_gpu[templateIdx +2]);
            sum = sum + temp*temp;
        }
    }

    __syncthreads();
    
//placing all values in shared memory for comparison.
    if(threadIdx.x < TEMPLATE_MATCH_DIM)
    {
        idx[threadIdx.x] = threadIdx.x;
        out_shared[threadIdx.x] = sum;
    }
    __syncthreads();


// //computing the Min location.....//

#pragma unroll
    for(int s=512; s>0; s>>=1) 
    {
        if ((threadIdx.x < s) &&((threadIdx.x + s)<TEMPLATE_MATCH_DIM))
        {
            idx[threadIdx.x] = out_shared[threadIdx.x] < out_shared[threadIdx.x + s] ? idx[threadIdx.x] : idx[threadIdx.x + s];
            out_shared[threadIdx.x]  = out_shared[threadIdx.x] < out_shared[threadIdx.x + s] ? out_shared[threadIdx.x] : out_shared[threadIdx.x + s];           
        }
        
    }

    __syncthreads();

    if(threadIdx.x <1)
    {
        int half_Margin = MARGIN_FOR_TEMPLATE_MATCH/2;
        int matchY = idx[0]/TEMPLATE_MATCH_SIZE ;
        int matchX = idx[0] - matchY * TEMPLATE_MATCH_SIZE;

        int diff = absolute(half_Margin - matchX) + absolute(half_Margin - matchY);
        if(diff < THRESHOLD)
        {
            Match[blockIdx.x] = 1;
        }
        else
            Match[blockIdx.x] = 0;

    }
}

Are you really asking how to speed up some code you haven't shown and barely even described by 50-60 times? — talonmies, Jan 11 '14 at 16:44
I have updated my query upon your request, Please let me know if you need further clearification. I hope this will be helpful in answering the questions. — Genutek, Jan 11 '14 at 19:17

score 1 · Accepted Answer · edited Aug 09 '20 at 17:35

I'll try to answer most of your questions.

Is there is way to determine if this task is achievable in realtime or not? I mean with the help of maximum FLOPS and I/O bandwidth limitations etc.

I have no idea how to determine whether or not the kernel is real time achievable , you can maximize your CUDA kernel using CUDA Occupancy Calculator. You may consider using texture, surface memory, constant memory, pinned host memory and more. Those are up to your algorithm implementation.

How to compute if the GPU is being fully utilitzed at its maximum?

You can use CUDA Occupancy Calculator and CUDA visual profiler. I'd highly recommend using visual profiler it would guide you through CUDA understanding.

Possible ways to improve the performance?

There are several interesting method doing so. 1st you can maximize your kernel call using the above method. If that's not enough, try implement pipeline using stream objects in order copy the data and computation jobs at the same time.

If that's not going to work out, try working with latency, operate multiple threads accessing the GPU at the same time, since CC 3.5 CUDA launched HyperQ, this may help you complete several calls in parallel.

If that's not going to work, consider using multiple GPU devices.

I have done some modification in the code and reduced time from 2sec to around 0.4 second, but still a long way to go to achieve 0.04 second. based on your answers I ran a few tests, and got some level of idea of where further improvements can be made. Do you have any idea about how much improvement gain will I get when using coalesce memory? — Genutek, Jan 14 '14 at 06:48
Coalesce memory can improve significantly CUDA kernel performance, try using texture \ surface memory for read only cashed memory. Improvment is algorithm dependable but it may improve by 10% runtime — TripleS, Jan 20 '14 at 06:48

How to determine CUDA gpu performance?

1 Answers1