why unroll can not accelerate transpose matrix?

Question

I am following a tutorial to learn cuda now and I learn that unroll a kernel function will accelerate the program. And it indeed works when I write a function which used to summarize a array.

But when I write a function used to transpose matrix following tutorial, it dosen't work. The origin function like below:

__global__ void transform_matrix_read_col(
    int* mat_a , int* mat_b , size_t row_num , size_t col_num
){
    int ix = threadIdx.x + blockDim.x * blockIdx.x;
    int iy = threadIdx.y + blockDim.y * blockIdx.y;
    int row_idx = iy*col_num + ix;
    int col_idx = ix*row_num + iy;
    if(ix < col_num && iy < row_num){
        mat_b[row_idx] = mat_a[col_idx];
    }
}

and unrool function:

__global__ void transform_matrix_read_col_unrool(
    int* mat_a , int* mat_b , size_t row_num , size_t col_num
){
    int ix = threadIdx.x +(blockDim.x * blockIdx.x * 4);
    int iy = threadIdx.y + blockDim.y * blockIdx.y;
    int row_idx = iy*col_num + ix;
    int col_idx = ix*row_num + iy;
    if(ix < col_num && iy < row_num){
        mat_b[row_idx] = mat_a[col_idx];
        mat_b[row_idx + blockDim.x*1] = mat_a[col_idx + row_num*blockDim.x*1];
        mat_b[row_idx + blockDim.x*2] = mat_a[col_idx + row_num*blockDim.x*2];
        mat_b[row_idx + blockDim.x*3] = mat_a[col_idx + row_num*blockDim.x*3];
    }
}

and the main function:

size_t width = 128 , height = 128,
    array_size = width*height,array_bytes = array_size * sizeof(int);
    int* matrix_data = nullptr,*output_data = nullptr;
    cudaMallocHost(&matrix_data, array_bytes);
    cudaMallocHost(&output_data, array_bytes);
    util::init_array_int(matrix_data,array_size);//this func will random generate some integer

    int* matrix_data_dev = nullptr,* output_matrix_dev = nullptr;
    cudaMalloc(&matrix_data_dev, array_bytes);
    cudaMemcpy(matrix_data_dev, matrix_data, array_bytes, cudaMemcpyHostToDevice);
    cudaMalloc(&output_matrix_dev, array_bytes);
    
    dim3 block(32,16);
    dim3 grid((width-1)/block.x+1,(height-1)/block.y+1);
    dim3 gridUnrool4((width-1)/(block.x*4)+1,(height-1)/block.y +1);

    transform_matrix_read_col<<<grid,block>>>(matrix_data_dev, output_matrix_dev, height, width);
    cudaDeviceSynchronize();

    transform_matrix_read_col_unrool<<<gridUnrool4,block>>>(matrix_data_dev, output_matrix_dev, height, width);
    cudaDeviceSynchronize();

and the staticstis of nsys(run on linux with a rtx 3090):

CUDA Kernel Statistics:

 Time(%)  Total Time (ns)  Instances  Average   Minimum  Maximum                                     Name                                    
 -------  ---------------  ---------  --------  -------  -------  ---------------------------------------------------------------------------

     6.3            3,456          1   3,456.0    3,456    3,456  transform_matrix_read_col_unrool(int*, int*, unsigned long, unsigned long) 
     5.2            2,880          1   2,880.0    2,880    2,880  transform_matrix_read_col(int*, int*, unsigned long, unsigned long)

We can see that unrool version slower a lot.
But on the tutorial , it say that unroll will acclerate transpose actually.

So What cause this problem? And how to accelerate transpose matrix ?

score 0 · Accepted Answer · answered Oct 29 '22 at 09:17

Unrolling only help if the computation is compute bound so that a higher (useful) instruction throughput can decrease the execution time. Memory-bound code tends not to be much faster once unrolled because memory-bound instruction are slowed down by the contention of the memory controller.

A transposition may not seem memory-bound at first glance because of a low apparent memory throughput, but one need to care about cache lines. Indeed, when a single value is requested from memory from the user code, the hardware actually request a pretty big cache line for (subsequent) contiguous accesses to be fast.

Another consideration to take into account is that the code can also be latency bound. Indeed, the inefficient strided accesses can be slow due to the memory latency. The memory controller may not be able to fully saturate the RAM in this case (although this is quite unlikely on GPUs, especially regarding the large cache lines). If so, adding more instruction do not help because they are typically executed in an in-order way as opposed to modern mainstream CPUs. Using larger blocks and more blocks helps to provide more parallelism to the GPUs which can then perform more concurrent memory accesses and possibly better use the memory.

The key with the transposition is to make accesses as contiguous as possible and reuse cache lines. The most critical thing is to operate on small 2D blocks and not on full row/lines (ie. not a 1D kernel) to increase the cache locality. Moreover, one efficient well-known solution is to use the shared memory: each threads of a CUDA block fetch a part of a 2D array block and can then perform the transposition in shared memory possibly more efficiently. It is not so easy due to possible shared memory conflicts that can impact performance. Fortunately, there are few research papers and articles talking about that since the last decades.

The simplest efficient solution to this problem is basically to use cuBLAS which is heavily optimized. This post may also be useful.

Note that a 128x128 transposition is very small for a GPU. GPUs are designed to compute bigger datasets (or far more expensive computations on such small input). If the input array is initially stored on the CPU, then I strongly advise you to do that directly on the CPU as moving data on the GPU will likely be already slower than computing the transposition efficiently on the CPU directly. Indeed, data cannot be moved faster than the main RAM permit and a 128x128 transposition can be implemented in a way it saturate the main RAM (in fact, it can be likely done directly in the CPU caches that are significantly faster than the main RAM).

The first sentence/paragraph is debatable/doubtful. One of the key benefits of unrolling is described [here](https://stackoverflow.com/questions/43832429/is-starting-1-thread-per-element-always-optimal-for-data-independent-problems-on/43833050#43833050); allowing reads (which have long latency) to be scheduled as early as possible. Because the GPU is not an out-of-order machine, if you have a dependency that prevents further instruction issue, then read latency may end up getting stacked end-to-end, rather than all being allowed to retire "in parallel", as early as possible. — Robert Crovella, Oct 29 '22 at 15:46
The GPU memory subsystem is certainly capable of having multiple reads in flight. The GPU compiler of course is aware of this and will seek to "work around" such dependencies, and unrolling is a key mechanism to enable that. An entirely memory-bound data copy kernel can benefit immensely from unrolling, unless you artificially prevent the compiler from unrolling such a copy loop. — Robert Crovella, Oct 29 '22 at 15:47

why unroll can not accelerate transpose matrix?

1 Answers1