GPU/CUDA: Re-ordering device memory

Question

I have a multi dimension array stored in device memory. I want to "permute"/"transpose", that is, re-arrange its elements according to new order of dimensions.

For example, if I have a 2D array

A = [0, 1, 2
     3, 4, 5]

I want to change the order of dimension so I get

B = [0, 3
     1, 4
     2, 5]

This re-ordering practically copies the elements that are stored in memory in the order [0,1,2,3,4,5] and return a new ordering [0,3,1,4,2,5].

I know how to map the indices from A to B, my question is how I can execute this mapping efficiently on device using cuda?

You could just leave the memory as is and use the alternative indexing scheme. That is what every sane linear algebra library does. — talonmies, Dec 23 '15 at 14:59
@talonmies I wish I could... but its part of a bigger library and the fansy indexing is not supported — Shai, Dec 23 '15 at 15:36

score 3 · Accepted Answer · answered Dec 23 '15 at 13:45

You could check this http://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc/

Naive Matrix Transpose:

__global__ void transposeNaive(float *odata, const float *idata)
{
  int x = blockIdx.x * TILE_DIM + threadIdx.x;
  int y = blockIdx.y * TILE_DIM + threadIdx.y;
  int width = gridDim.x * TILE_DIM;

  for (int j = 0; j < TILE_DIM; j+= BLOCK_ROWS)
    odata[x*width + (y+j)] = idata[(y+j)*width + x];
}

Coalesced Transpose Via Shared Memory:

__global__ void transposeCoalesced(float *odata, const float *idata)
{
  __shared__ float tile[TILE_DIM][TILE_DIM];

  int x = blockIdx.x * TILE_DIM + threadIdx.x;
  int y = blockIdx.y * TILE_DIM + threadIdx.y;
  int width = gridDim.x * TILE_DIM;

  for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
     tile[threadIdx.y+j][threadIdx.x] = idata[(y+j)*width + x];

  __syncthreads();

  x = blockIdx.y * TILE_DIM + threadIdx.x;  // transpose block offset
  y = blockIdx.x * TILE_DIM + threadIdx.y;

  for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
     odata[(y+j)*width + x] = tile[threadIdx.x][threadIdx.y + j];
}

Thank you for your answer. I'm a GPU newbie so it might take me a while to parse. But in the attached diagram, shouldn't the green column be vertical in `odata`? — Shai, Dec 23 '15 at 18:01
The objective pictured in the diagram is to have coalesced access for both the *reads* from `idata` and the *writes* to `odata`. In a nutshell, coalescing in this context will be achieved by "horizontal" access. Therefore the orientation of the yellow operation (read from `idata`) and the green operation (write to `odata`) are intended to both have "horizontal" access as they touch the global data arrays. The actual in-tile transposing is occurring in shared memory, which is depicted by the tile block in the center of the diagram. This is where the reads and writes are horizontal/vertical. — Robert Crovella, Dec 23 '15 at 18:45

GPU/CUDA: Re-ordering device memory

1 Answers1