Matrix the Rectangle Part transpose Cuda

Question

im writing Cuda Program to Transpose Square Matrix, the idea is to do it in two parts depending on size of matrix; the matrix size cut into even size with Tile , and remain rectangle part left i transpose it separately Ex: 67 x 67 Matrix with Tile : 32, first part is 64x64 transposed, then second part is 3x67.

my problem is in the rectangle part, first below code shows the main code with the defined values:

const int TILE_DIM = 32;
const int BLOCK_ROWS = 8;
const int NUM_REPS = 100;

const int Nx = 2024; //size of the matrix
const int Ny = 2024;

int main(int argc, char **argv)
{
const int nx = Nx;
const int ny = Ny; // Size of the Arrays
const int mem_size = nx*ny*sizeof(int);// Size of the Orig.Arr

int *h_idata = (int*)malloc(mem_size); // original Host Arr.

int *d_idata; //device Arr.
checkCuda(cudaMalloc(&d_idata, mem_size));

dim3 dimGridX(nx / TILE_DIM, 1, 1); //grid dimension used
dim3 dimBlockX(TILE_DIM, 1, 1); // number of threads used

// the Kernel Function for only the rectangle
EdgeTransposeX << < dimGrid, dimBlock >> >(d_idata);
cudaEventRecord(startEvent, 0);
cudaEventRecord(stopEvent, 0);
cudaEventSynchronize(stopEvent);
cudaEventElapsedTime(&ms, startEvent, stopEvent);
cudaMemcpy(h_idata, d_idata, mem_size, cudaMemcpyDeviceToHost);

the Kernel Code i was advised not to use shared, so below is how ive done :

__global__ void EdgeTransposeX(int *idata)
{

    int tile_C[Edge][Nx];
    int tile_V[Nx][Edge];

    int x = blockIdx.x * TILE_DIM + threadIdx.x;

    if (x == (nEven - 1))
    {

        for (int j = 0; j < Nx; j++)
            for (int i = 1; i <= Edge; i++)
            {

            tile_V[j][i - 1] = idata[j*Nx + (x + i)];
             tile_C[i - 1][j] = idata[(x + i)*Nx + j];}

         __syncthreads();

        for (int j = 0; j < Nx; j++)
          for (int i = 1; i <= Edge; i++)
         {
            idata[j*Nx + (x + i)] = tile_C[i - 1][j];
            idata[(x + i)*Nx + j] = tile_V[j][i - 1];}

       } }

the code works Okay until matrix size reaches 1025, after that it stops working, any idea why ? am i missing something here ?

One of the things you seem to be missing is [proper cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api). — Robert Crovella, Mar 18 '15 at 01:22

DanRo · Answer 1 · 2015-03-19T23:35:24.473

your two-dimentional arrays tile_C and tile_V are fisically stored in GPU's local memory. The amount of local memory per thread is 512KB. Verify that you are not using more than 512KB of local memory per thread.

An automatic variable declared in device code without any of the device, shared and constant qualifiers described in this section generally resides in a register. However in some cases the compiler might choose to place it in local memory. This fragment was taken from "CUDA C PROGRAMMING GUIDE 2015" pag 89.

My suggestion is that you use the visual profiler to check the occupancy, register and local memory usage.

This link may be helpful for you: link.

I implemented the Transpose of a Square Matrix using cuda surfaces in 2D, it works fine for sizes from 2 to 16384 with increments in power of two. If you dont mind implement a no tiled version, i recomend this approach.

thanks downloaded GPU viewer, i have under CUDA Threads / block 1024, so if this my issue, what is recommended to do ? shall i use shared memory ? i have tried using it but gives me same issue, — userTasim, Mar 18 '15 at 21:57

Matrix the Rectangle Part transpose Cuda

1 Answers1