Simple CUDA Kernel Optimization

Question

In the process of speeding up an application, I have a very simple kernel which does the type casting as shown below:

__global__ void UChar2FloatKernel(float *out, unsigned char *in, int nElem){
    unsigned int i = (blockIdx.x * blockDim.x) + threadIdx.x;
    if(i<nElem)
        out[i] = (float) in[i];
}

The global memory access is coalesced and in my understanding using shared memory will also not be beneficial as there are not multiple reads of the same memory. Does any one have any idea if there is any optimization which can be performed to speed up this kernel. The input and output data is already on the device, so no host to device memory copy will be required.

You're right, the use of shared memory won't bring you any beinft, because you still have to load the elements from `in` once from global memory and write it back once to `out`. Maybe you can get a advantage if a single thread computes several elements. But you have to try it. If you haven't already done it, you can read the ["Best Practises Guide"](http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/). Maybe there is something new hints for you. — hubs, Jan 24 '14 at 12:08
adding to hubs comment, try using vector data types like float4, uchar4 and even let us know Does it make sense in performance. — Meluha, Jan 24 '14 at 13:29
You cannot improve it further. A kernel for doing such a simple thing is a waste. If the data you are converting will be used as input for another kernel, then perform the conversion at THAT kernel. That will add a small extra computation to the kernel but will be hidden by the I/O gain of reading chars instead of floats. — DarkZeros, Jan 24 '14 at 14:43
thanks. i will have a look at the guide to see if i missed something. @DarkZeros i am doing this extra step to keep the ported version similar to original version, but as you said its a waste and the efficient solution will be to do it at the end of production or before the consumption. — Imran, Jan 24 '14 at 18:49

talonmies · Accepted Answer · 2016-01-24T10:22:07.667

The single biggest optimisation you can perform on a code like that one is to use resident threads and increase the number of transactions each thread performs. While the CUDA block scheduling model is pretty lightweight, it isn't free, and launching a lot blocks containing threads which do only a single memory load and single memory store will accrue a lot of block scheduling overhead. So only launch as many blocks as will "fill" the all the SM of your GPU and have each thread do more work.

The second obvious optimization is switch to 128 byte memory transactions for loads, which should give you a tangible bandwidth utilization gain. On a Fermi or Kepler GPU this won't give as large a performance boost as on first and second generation hardware.

Putting this altogether into a simple benchmark:

__global__ 
void UChar2FloatKernel(float *out, unsigned char *in, int nElem)
{
    unsigned int i = (blockIdx.x * blockDim.x) + threadIdx.x;
    if(i<nElem)
        out[i] = (float) in[i];
}

__global__
void UChar2FloatKernel2(float  *out, 
                const unsigned char *in, 
            int nElem)
{
    unsigned int i = (blockIdx.x * blockDim.x) + threadIdx.x;    
    for(; i<nElem; i+=gridDim.x*blockDim.x) {
        out[i] = (float) in[i];
    }
}

__global__
void UChar2FloatKernel3(float4  *out, 
                const uchar4 *in, 
            int nElem)
{
    unsigned int i = (blockIdx.x * blockDim.x) + threadIdx.x;    
    for(; i<nElem; i+=gridDim.x*blockDim.x) {
        uchar4 ival = in[i]; // 32 bit load
        float4 oval = make_float4(ival.x, ival.y, ival.z, ival.w);
        out[i] = oval; // 128 bit store
    }
}

int main(void)
{

    const int n = 2 << 20;
    unsigned char *a = new unsigned char[n];

    for(int i=0; i<n; i++) {
        a[i] = i%255;
    }

    unsigned char *a_;
    cudaMalloc((void **)&a_, sizeof(unsigned char) * size_t(n));
    float *b_;
    cudaMalloc((void **)&b_, sizeof(float) * size_t(n));
    cudaMemset(b_, 0, sizeof(float) * size_t(n)); // warmup

    for(int i=0; i<5; i++)
    {
        dim3 blocksize(512);
        dim3 griddize(n/512);
        UChar2FloatKernel<<<griddize, blocksize>>>(b_, a_, n);
    }

    for(int i=0; i<5; i++)
    {
        dim3 blocksize(512);
        dim3 griddize(8); // 4 blocks per SM
        UChar2FloatKernel2<<<griddize, blocksize>>>(b_, a_, n);
    }

    for(int i=0; i<5; i++)
    {
        dim3 blocksize(512);
        dim3 griddize(8); // 4 blocks per SM
        UChar2FloatKernel3<<<griddize, blocksize>>>((float4*)b_, (uchar4*)a_, n/4);
    }
    cudaDeviceReset();
    return 0;
}

gives me this on a small Fermi device:

>nvcc -m32 -Xptxas="-v" -arch=sm_21 cast.cu
cast.cu
tmpxft_000014c4_00000000-5_cast.cudafe1.gpu
tmpxft_000014c4_00000000-10_cast.cudafe2.gpu
cast.cu
ptxas : info : 0 bytes gmem
ptxas : info : Compiling entry function '_Z18UChar2FloatKernel2PfPKhi' for 'sm_2
1'
ptxas : info : Function properties for _Z18UChar2FloatKernel2PfPKhi
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 5 registers, 44 bytes cmem[0]
ptxas : info : Compiling entry function '_Z18UChar2FloatKernel3P6float4PK6uchar4
i' for 'sm_21'
ptxas : info : Function properties for _Z18UChar2FloatKernel3P6float4PK6uchar4i
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 8 registers, 44 bytes cmem[0]
ptxas : info : Compiling entry function '_Z17UChar2FloatKernelPfPhi' for 'sm_21'

ptxas : info : Function properties for _Z17UChar2FloatKernelPfPhi
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 3 registers, 44 bytes cmem[0]
tmpxft_000014c4_00000000-5_cast.cudafe1.cpp
tmpxft_000014c4_00000000-15_cast.ii

>nvprof a.exe
======== NVPROF is profiling a.exe...
======== Command: a.exe
======== Profiling result:
 Time(%)      Time   Calls       Avg       Min       Max  Name
   40.20    6.61ms       5    1.32ms    1.32ms    1.32ms  UChar2FloatKernel(float*, unsigned char*, int)
   29.43    4.84ms       5  968.32us  966.53us  969.46us  UChar2FloatKernel2(float*, unsigned char const *, int)
   26.35    4.33ms       5  867.00us  866.26us  868.10us  UChar2FloatKernel3(float4*, uchar4 const *, int)
    4.02  661.34us       1  661.34us  661.34us  661.34us  [CUDA memset]

In the latter two kernel, using only 8 blocks gives a large speed up compared to 4096 blocks, which confirms the idea that multiple work items per thread is the best way to improve performance in this sort of memory bound, low instruction count kernel.

Do you have any reference about that thread scheduling overhead that you mention above? What i don't quite understand is how single memory read/writes can affect the thread scheduling at all. And also, thread scheduling is only done once isn't it? — Greg K., Jan 22 '16 at 02:12
@GregKasapidis: I don't see where I mentioned thread scheduling overhead in that answer. — talonmies, Jan 22 '16 at 06:07
"While the CUDA block scheduling model is pretty lightweight, it isn't free, and launching a lot of threads which do only a single memory load and single memory store will accrue a lot of scheduling overhead." — Greg K., Jan 22 '16 at 12:39
I recommend reading all the words in that sentence again. Nowhere did I mention *thread* scheduling. I mentioned *block* scheduling. They are different things. — talonmies, Jan 22 '16 at 12:45
The profiling result (with Nsight Sytem) on a 3080 Ti (Ampere CC8.6) looks very different. I increased the problem size to `2 << 22` and set the grid size to `4 * numSms` where numSms is obtained from the runtime API. Now `UChar2FloatKernel` takes 34.7% of the total time, whereas `UChar2FloatKernel2` which is supposed to be much faster takes as much as 35.2%. Does this indicate a change in hardware where the block scheduling overhead has become completely negligible? @talonmies @Robert Crovella ? — biubiuty, Jul 24 '23 at 04:22

score 2 · Answer 2 · answered Jan 25 '14 at 10:46

Here is a cpu version of the function and 4 gpu kernels. 3 kernels are from @talonmies answer and I have added kernel2 which only utilizes vector data types only.

// cpu version for comparison
void UChar2Float(unsigned char *a, float *b, const int n){
    for(int i=0;i<n;i++)
        b[i] = (float)a[i];
}

__global__ void UChar2FloatKernel1(float *out, const unsigned char *in, int nElem){
    unsigned int i = (blockIdx.x * blockDim.x) + threadIdx.x;
    if(i<nElem)     out[i] = (float) in[i];
}

__global__ void UChar2FloatKernel2(float4  *out, const uchar4 *in, int nElem){
    unsigned int i = (blockIdx.x * blockDim.x) + threadIdx.x;
    if(i<nElem) {
        uchar4 ival = in[i]; // 32 bit load
        float4 oval = make_float4(ival.x, ival.y, ival.z, ival.w);
        out[i] = oval; // 128 bit store
    }
}

__global__ void UChar2FloatKernel3(float  *out, const unsigned char *in, int nElem) {
    unsigned int i = (blockIdx.x * blockDim.x) + threadIdx.x;
    for(; i<nElem; i+=gridDim.x*blockDim.x) 
    {
        out[i] = (float) in[i];
    }
}

__global__ void UChar2FloatKernel4(float4  *out, const uchar4 *in, int nElem) {
    unsigned int i = (blockIdx.x * blockDim.x) + threadIdx.x;
    for(; i<nElem; i+=gridDim.x*blockDim.x) 
    {
        uchar4 ival = in[i]; // 32 bit load
        float4 oval = make_float4(ival.x, ival.y, ival.z, ival.w);
        out[i] = oval; // 128 bit store
    }
}

On my Geforce GT 640, here are the timing results:

simpleKernel (cpu):         0.101463 seconds.
simpleKernel 1 (gpu):       0.007845 seconds.
simpleKernel 2 (gpu):       0.004914 seconds.
simpleKernel 3 (gpu):       0.005461 seconds.
simpleKernel 4 (gpu):       0.005461 seconds.

So we can see kernel2 which utilizes vector types only, is the winner. I have done these tests for (32 * 1024 * 768) elements. nvprof output is also shown below:

Time(%)      Time     Calls       Avg       Min       Max  Name
91.68%  442.45ms         4  110.61ms  107.43ms  119.51ms  [CUDA memcpy DtoH]
3.76%  18.125ms         1  18.125ms  18.125ms  18.125ms  [CUDA memcpy HtoD]
1.43%  6.8959ms         1  6.8959ms  6.8959ms  6.8959ms  UChar2FloatKernel1(float*, unsigned char const *, int)
1.10%  5.3315ms         1  5.3315ms  5.3315ms  5.3315ms  UChar2FloatKernel3(float*, unsigned char const *, int)
1.04%  5.0184ms         1  5.0184ms  5.0184ms  5.0184ms  UChar2FloatKernel4(float4*, uchar4 const *, int)
0.99%  4.7816ms         1  4.7816ms  4.7816ms  4.7816ms  UChar2FloatKernel2(float4*, uchar4 const *, int)

score 1 · Answer 3 · answered Jan 24 '14 at 17:09

You can decorate the input array by the const __restrict__ qualifiers which notifies the compiler that the data is read-only and not aliased by any other pointer. In this way, the compiler will detect that the access is uniform and can optimise it by using one of the read-only caches (the constant cache or, on compute capability >=3.5, read-only data cache known as texture cache).

You can also decorate the output array by the __restrict__ qualifier to suggest the compiler other optimizations.

Finally, the recommendation by DarkZeros is worth to be followed.

score 0 · Answer 4 · answered Jan 24 '14 at 19:04

0

You better write a vectorized version of your code, writing float4 into out at once. this should be pretty straightforward in case nElem happens to be a boundary of 4-multiple, otherwise, u might need to mind a residue.

answered Jan 24 '14 at 19:04

sramij

4,775
5
33
55

Simple CUDA Kernel Optimization

4 Answers4

Linked