2D tiled convolution taking more time than untiled version

Question

Writing a code that perform a 2D convolution on a float matrix, in both tiled and untiled version. I'm assuming the width of the tile as

BLOCK_SIZE - MASK_WIDTH + 1

, using halo cells.

But for a 1024 matrix and masks varing from 3 to 9 I get the untiled version performing better:

untiled version

vs

tiled

Both matrix and mask are defined in a constant manner, equal for tiled and untiled. No random values/sizes used. I guess I'm doing some wrong assumption about the tile size, but even after doing some research the implementation seems quite legit.

 #define MATRIX_SIZE 1024
 #define BLOCK_WIDTH 32

Here's the kernel code for the tiled version

__global__ void convolution_2D_tiled(float* in, const float* __restrict__ mask, float* out, size_t mask_width, size_t w, size_t h) {
    float outputPixel = 0; //minimize write to global memory: stored in register

int tx = threadIdx.x;
int ty = threadIdx.y;
int tile_width = BLOCK_WIDTH - mask_width + 1; //since BLOCK_WIDTH = TILE_WIDTH + MASK_WIDTH - 1

int col = blockIdx.x * tile_width + tx;
int row = blockIdx.y * tile_width + ty;
//picking the starting indexes of input matrix inside the mask
//(TOP-LEFT of the mask)
int inputRow = row - (mask_width / 2);
int inputCol = col - (mask_width / 2);

__shared__ float tile[BLOCK_WIDTH][BLOCK_WIDTH];

// Load tile elements
if (inputRow >= 0 && inputRow < h && inputCol >= 0 && inputCol < w)
    tile[ty][tx] = in[inputRow * w + inputCol];
else
    tile[ty][tx] = 0.0;

// Wait until all tile elements are loaded
__syncthreads();

//some thread won't write any outputs, only need to calculate tile_width elements
if (col < w && row < h && ty < tile_width && tx < tile_width) {
    //get the neighbour in the mask
    for (int i = 0; i < mask_width; ++i) {
        for (int j = 0; j < mask_width; ++j) { //(Mask_Width^2) access for each thread in block -> for each block (Mask_Width^2) * (Block_width^2)
            outputPixel += tile[i + ty][j + tx] * mask[i * mask_width + j];
        }
    }
    out[(row * w) + col] = (float)(outputPixel);
}
}

The main with the matrix generation and sizes assumptions:

void errorCheck(unsigned int line){
    cudaError_t cudaError = cudaGetLastError();

    // if error code wasn't a code describing success
    if (cudaError != cudaSuccess)
    {
       // output that there has been a CUDA error in the line of the CUDA function call
       // and exit the program
       printf("CUDA error in line %u in file %s: %s\n", line - 1, __FILE__,    cudaGetErrorString(cudaError));
       exit(EXIT_FAILURE);
    }}



int main(int argc, char const* argv[]){


for (size_t mask_width = 3; mask_width <= 9; mask_width += 2) {
    printf("Testing with mask size = %d\n\n", mask_width);
    float* a;
    float* b;
    float* c;
    cudaMallocManaged((void **) &a, sizeof(float)*MATRIX_SIZE*MATRIX_SIZE);
    cudaMallocManaged((void **) &b, sizeof(int)*mask_width*mask_width);
    cudaMallocManaged((void **) &c, sizeof(int)*MATRIX_SIZE*MATRIX_SIZE);


    // initialize matrix A
    for (int i = 0; i < MATRIX_SIZE; ++i) {
        for (int j = 0; j < MATRIX_SIZE; ++j) {
          a[i * MATRIX_SIZE + j] = (float)(1 +(3 * j % 20));
        }
    }

    // initialize matrix B
    for (int i = 0; i < mask_width; ++i) {
        for (int j = 0; j < mask_width; ++j) {
            b[i * mask_width + j] = (float)(1 + (((2 * i) + j) % mask_width));
        }
    }

    float  naive_gpu_elapsed_time_ms;

    // some events to count the execution time
    //clock_t st, end;
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    int tile_width = BLOCK_WIDTH - mask_width + 1;
    dim3 dimGrid(MATRIX_SIZE / tile_width, MATRIX_SIZE / tile_width);
    dim3 dimBlock(BLOCK_WIDTH, BLOCK_WIDTH);
    errorCheck(__LINE__);


    cudaEventRecord(start, 0);
    convolution_2D_tiled <<<dimGrid, dimBlock >>> (a, b, c, mask_width, MATRIX_SIZE, MATRIX_SIZE);
    errorCheck(__LINE__);
    cudaThreadSynchronize();

    //time counting terminate
    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);



    //compute time elapsed on GPU computing
    cudaEventElapsedTime(&naive_gpu_elapsed_time_ms, start, stop);
    printf("Time elapsed on naive GPU convolution 2d tiled ( %d ) block %f ms.\n\n", BLOCK_WIDTH, naive_gpu_elapsed_time_ms);

    //free memory
    cudaFree(a);
    cudaFree(b);
    cudaFree(c);
    printf("________________________________________________________________________\n\n");
}
return 0;

}

I'm using google colab with Tesla T4 GPU, and no CUDA error is thrown. Also tried to use bigger masks (11, 15 ..) but no changes in comparison between tiled and untiled.

32, the specs of T4 says max thread per block is 1024 @Homer512 — ctrlaltdel, Oct 29 '22 at 09:58
1024 threads per block is too much. This will hurt utilization because on every ```__syncthreads()```, too many threads at once have to wait for stragglers. Pipelining suffers which eats up all benefits you get compared to simply letting the L2 cache handle the tiling. Try 256 threads per block — Homer512, Oct 29 '22 at 10:00
I just tried with a 16 BLOCK_WIDTH but timing is increased compared to 32 width @Homer512 — ctrlaltdel, Oct 29 '22 at 10:06
Have you compared performance with nvidia's implementation in NPP? https://docs.nvidia.com/cuda/npp/group__image__convolution.html Also, any reason why you don't cache the mask? — Homer512, Oct 29 '22 at 12:18
There is also some research that suggests shared memory is no longer the be-all-end-all for convolution. https://ipsj.ixsq.nii.ac.jp/ej/?action=repository_action_common_download&item_id=186051&item_no=1&attribute_id=1&file_no=1 I believe I once had a better paper on the topic that just relied on caches, unrolling and explicit prefetching but I can't find it right now. — Homer512, Oct 29 '22 at 12:34

Robert Crovella · Answer 1 · 2022-10-29T22:46:30.013

You are making inefficient usage of managed memory as discussed here and here.

Nearly all of your ~2ms of execution time is used in inefficient demand-paged copying of data from host to device. As a result, your ability to resolve the difference in performance in the two cases due to the device code changes is almost completely obscured.

If you add these 3 lines of code immediately before float naive_gpu_elapsed_time_ms;, you will observe that your reported execution times decrease dramatically, and you should be able to better judge the performance difference between the shared memory tiled version and the non-tiled version:

cudaMemPrefetchAsync(a, sizeof(float)*MATRIX_SIZE*MATRIX_SIZE, 0);
cudaMemPrefetchAsync(b, sizeof(int)*mask_width*mask_width, 0);
cudaMemPrefetchAsync(c, sizeof(int)*MATRIX_SIZE*MATRIX_SIZE, 0);

You haven't shown your non-tiled code, so I can't demonstrate that for you. Here's an example profiling output using a non-tiled convolution code that I wrote, comparing to your tiled kernel, and including the cudaMemPrefetchAsync() statements:

$ nvprof ./t2140
Testing with mask size = 3

==13236== NVPROF is profiling process 13236, command: ./t2140
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 0.032832 ms.

________________________________________________________________________

Testing with mask size = 5

Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 0.061120 ms.

________________________________________________________________________

Testing with mask size = 7

Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 0.086080 ms.

________________________________________________________________________

Testing with mask size = 9

Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 0.118688 ms.

________________________________________________________________________

==13236== Profiling application: ./t2140
==13236== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   52.59%  311.69us         4  77.922us  41.089us  119.08us  convolution_2D(float*, float const *, float*, unsigned long, unsigned long, unsigned long)
                   47.41%  280.97us         4  70.241us  28.449us  114.28us  convolution_2D_tiled(float*, float const *, float*, unsigned long, unsigned long, unsigned long)
      API calls:   96.10%  365.32ms        12  30.443ms  12.906us  365.10ms  cudaMallocManaged
                    1.32%  5.0301ms         4  1.2575ms  586.91us  3.2433ms  cuDeviceTotalMem
                    0.66%  2.4917ms       404  6.1670us     320ns  268.82us  cuDeviceGetAttribute
                    0.56%  2.1277ms        12  177.31us  8.3020us  578.90us  cudaMemPrefetchAsync
                    0.50%  1.9035ms         4  475.88us  295.08us  549.01us  cudaDeviceSynchronize
                    0.49%  1.8594ms        12  154.95us  75.533us  328.85us  cudaFree
                    0.14%  526.53us         4  131.63us  42.014us  220.14us  cudaEventSynchronize
                    0.11%  399.28us         4  99.820us  61.310us  210.74us  cuDeviceGetName
                    0.09%  351.52us         8  43.940us  11.426us  116.52us  cudaLaunchKernel
                    0.01%  45.911us         8  5.7380us  4.1870us  10.243us  cudaEventRecord
                    0.01%  25.946us         8  3.2430us     935ns  10.182us  cudaEventCreate
                    0.01%  21.643us         4  5.4100us  3.1450us  8.6700us  cuDeviceGetPCIBusId
                    0.00%  10.304us         8  1.2880us     430ns  5.0980us  cuDeviceGet
                    0.00%  9.6790us         4  2.4190us  1.9560us  3.7180us  cudaEventElapsedTime
                    0.00%  3.3390us         3  1.1130us     617ns  1.6520us  cuDeviceGetCount
                    0.00%  3.2480us         4     812ns     700ns  1.0470us  cuDeviceGetUuid
                    0.00%  3.1420us         8     392ns     229ns  1.2110us  cudaGetLastError

==13236== Unified Memory profiling result:
Device "Tesla V100-PCIE-32GB (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
      12  1.3346MB  4.0000KB  2.0000MB  16.01563MB  1.405760ms  Host To Device
Total CPU Page faults: 52
$

You can see that in each case, the tiled/shared memory kernel is faster.

2D tiled convolution taking more time than untiled version

1 Answers1