im writing Cuda Program to Transpose Square Matrix, the idea is to do it in two parts depending on size of matrix; the matrix size cut into even size with Tile , and remain rectangle part left i transpose it separately Ex: 67 x 67 Matrix with Tile : 32, first part is 64x64 transposed, then second part is 3x67.
my problem is in the rectangle part, first below code shows the main code with the defined values:
const int TILE_DIM = 32;
const int BLOCK_ROWS = 8;
const int NUM_REPS = 100;
const int Nx = 2024; //size of the matrix
const int Ny = 2024;
int main(int argc, char **argv)
{
const int nx = Nx;
const int ny = Ny; // Size of the Arrays
const int mem_size = nx*ny*sizeof(int);// Size of the Orig.Arr
int *h_idata = (int*)malloc(mem_size); // original Host Arr.
int *d_idata; //device Arr.
checkCuda(cudaMalloc(&d_idata, mem_size));
dim3 dimGridX(nx / TILE_DIM, 1, 1); //grid dimension used
dim3 dimBlockX(TILE_DIM, 1, 1); // number of threads used
// the Kernel Function for only the rectangle
EdgeTransposeX << < dimGrid, dimBlock >> >(d_idata);
cudaEventRecord(startEvent, 0);
cudaEventRecord(stopEvent, 0);
cudaEventSynchronize(stopEvent);
cudaEventElapsedTime(&ms, startEvent, stopEvent);
cudaMemcpy(h_idata, d_idata, mem_size, cudaMemcpyDeviceToHost);
the Kernel Code i was advised not to use shared, so below is how ive done :
__global__ void EdgeTransposeX(int *idata)
{
int tile_C[Edge][Nx];
int tile_V[Nx][Edge];
int x = blockIdx.x * TILE_DIM + threadIdx.x;
if (x == (nEven - 1))
{
for (int j = 0; j < Nx; j++)
for (int i = 1; i <= Edge; i++)
{
tile_V[j][i - 1] = idata[j*Nx + (x + i)];
tile_C[i - 1][j] = idata[(x + i)*Nx + j];}
__syncthreads();
for (int j = 0; j < Nx; j++)
for (int i = 1; i <= Edge; i++)
{
idata[j*Nx + (x + i)] = tile_C[i - 1][j];
idata[(x + i)*Nx + j] = tile_V[j][i - 1];}
} }
the code works Okay until matrix size reaches 1025, after that it stops working, any idea why ? am i missing something here ?