1

I have a multi dimension array stored in device memory. I want to "permute"/"transpose", that is, re-arrange its elements according to new order of dimensions.

For example, if I have a 2D array

A = [0, 1, 2
     3, 4, 5]

I want to change the order of dimension so I get

B = [0, 3
     1, 4
     2, 5]

This re-ordering practically copies the elements that are stored in memory in the order [0,1,2,3,4,5] and return a new ordering [0,3,1,4,2,5].

I know how to map the indices from A to B, my question is how I can execute this mapping efficiently on device using cuda?

Shai
  • 111,146
  • 38
  • 238
  • 371
  • 2
    You could just leave the memory as is and use the alternative indexing scheme. That is what every sane linear algebra library does. – talonmies Dec 23 '15 at 14:59
  • @talonmies I wish I could... but its part of a bigger library and the fansy indexing is not supported – Shai Dec 23 '15 at 15:36

1 Answers1

3

You could check this http://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc/

Naive Matrix Transpose:

__global__ void transposeNaive(float *odata, const float *idata)
{
  int x = blockIdx.x * TILE_DIM + threadIdx.x;
  int y = blockIdx.y * TILE_DIM + threadIdx.y;
  int width = gridDim.x * TILE_DIM;

  for (int j = 0; j < TILE_DIM; j+= BLOCK_ROWS)
    odata[x*width + (y+j)] = idata[(y+j)*width + x];
}

Coalesced Transpose Via Shared Memory: enter image description here

__global__ void transposeCoalesced(float *odata, const float *idata)
{
  __shared__ float tile[TILE_DIM][TILE_DIM];

  int x = blockIdx.x * TILE_DIM + threadIdx.x;
  int y = blockIdx.y * TILE_DIM + threadIdx.y;
  int width = gridDim.x * TILE_DIM;

  for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
     tile[threadIdx.y+j][threadIdx.x] = idata[(y+j)*width + x];

  __syncthreads();

  x = blockIdx.y * TILE_DIM + threadIdx.x;  // transpose block offset
  y = blockIdx.x * TILE_DIM + threadIdx.y;

  for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
     odata[(y+j)*width + x] = tile[threadIdx.x][threadIdx.y + j];
}
Humam Helfawi
  • 19,566
  • 15
  • 85
  • 160
  • Thank you for your answer. I'm a GPU newbie so it might take me a while to parse. But in the attached diagram, shouldn't the green column be vertical in `odata`? – Shai Dec 23 '15 at 18:01
  • 7
    The objective pictured in the diagram is to have coalesced access for both the *reads* from `idata` and the *writes* to `odata`. In a nutshell, coalescing in this context will be achieved by "horizontal" access. Therefore the orientation of the yellow operation (read from `idata`) and the green operation (write to `odata`) are intended to both have "horizontal" access as they touch the global data arrays. The actual in-tile transposing is occurring in shared memory, which is depicted by the tile block in the center of the diagram. This is where the reads and writes are horizontal/vertical. – Robert Crovella Dec 23 '15 at 18:45