5

Say I have a matrix with a dimension of A*B on GPU, where B (number of columns) is the leading dimension assuming a C style. Is there any method in CUDA (or cublas) to transpose this matrix to FORTRAN style, where A (number of rows) becomes the leading dimension?

It is even better if it could be transposed during host->device transfer while keep the original data unchanged.

dreamcrash
  • 47,137
  • 25
  • 94
  • 117
Hailiang Zhang
  • 17,604
  • 23
  • 71
  • 117
  • Because CUBLAS can operate on both transposed and normal matrices, you probably don't need to explicitly calculate the matrix transpose, even when working with matrices which are in row major order. – talonmies Dec 09 '12 at 09:55

4 Answers4

12

as asked within the title, to transpose a device row-major matrix A[m][n], one can do it this way:

    float* clone = ...;//copy content of A to clone
    float const alpha(1.0);
    float const beta(0.0);
    cublasHandle_t handle;
    cublasCreate(&handle);
    cublasSgeam( handle, CUBLAS_OP_T, CUBLAS_OP_N, m, n, &alpha, clone, n, &beta, clone, m, A, m );
    cublasDestroy(handle);

And, to multiply two row-major matrices A[m][k] B[k][n], C=A*B

    cublasSgemm( handle, CUBLAS_OP_N, CUBLAS_OP_N, n, m, k, &alpha, B, n, A, k, &beta, C, n );

where C is also a row-major matrix.

Feng Wang
  • 1,506
  • 15
  • 17
  • 4
    Is it possible to do this without a clone? – bge0 Oct 17 '14 at 19:50
  • Apparently not, in the documentation some instruction are written about "in place" computation but I didn't find a way to do it without a temporary matrix. Hopefully my matrices are kind of small at this point of the whole process. – Romain Laneuville Mar 27 '20 at 15:29
4

The CUDA SDK includes a matrix transpose, you can see here examples of code on how to implement one, ranging from a naive implementation to optimized versions.

For example:

Naïve transpose

__global__ void transposeNaive(float *odata, float* idata,
int width, int height, int nreps)
{
    int xIndex = blockIdx.x*TILE_DIM + threadIdx.x;
    int yIndex = blockIdx.y*TILE_DIM + threadIdx.y;
    int index_in = xIndex + width * yIndex;
    int index_out = yIndex + height * xIndex;

    for (int r=0; r < nreps; r++)
    {
        for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS)
        {
          odata[index_out+i] = idata[index_in+i*width];
        }
    }
}

Like talonmies had point out you can specify if you want operate the matrix as transposed or not, in cublas matrix operations eg.: for cublasDgemm() where C = a * op(A) * op(B) + b * C, assuming you want to operate A as transposed (A^T), on the parameters you can specify if it is ('N' normal or 'T' transposed)

dreamcrash
  • 47,137
  • 25
  • 94
  • 117
4

The version of CUBLAS bundled with the CUDA 5 toolkit contains a BLAS-like method (cublasgeam) that could be used to transpose a matrix. It's documented here.

Talal
  • 41
  • 1
0

Here's a working example:

#include "cublas_v2.h"
#include <vector>
#include <iostream>
using std::cout;

void print_matrix(float *data, int rows, int cols) {
    cout << "[";
    for( int row=0; row < rows; row++) {
        cout << "[";
        for( int col=0; col < cols; col++) {
            cout << data[row*cols+col] << ",";
        }
        cout << "]";
    }
    cout << "]";
}

int main() {
    // allocate host vector
    std::vector<float> h_a = {1,2,3,4,5,6,7,8,9,10};
    int nbytes=h_a.size()*sizeof(*h_a.data());
    std::vector<float> h_b(h_a.size());

    // define the number or rows and the number of columns
    int m=2,n=5;

    // allocate device vectors
    float *d_a, *d_b;
    cudaMalloc(&d_a, nbytes);
    cudaMalloc(&d_b, nbytes);

    // copy host vector to device
    cudaMemcpy(d_a,h_a.data(), nbytes, cudaMemcpyHostToDevice);

    // perform a transpose
    {

        float alpha=1;
        float *A=d_a;
        int lda=n;

        float beta=0;
        float *B=NULL;
        int ldb=n;

        float *C=d_b;
        int ldc=m;

        cublasHandle_t handle;
        cublasCreate(&handle);
        cublasStatus_t success=cublasSgeam( handle, CUBLAS_OP_T, CUBLAS_OP_N, m, n, &alpha, A, lda, &beta, B, ldb, C, ldc);
        if ( success != CUBLAS_STATUS_SUCCESS)
            cout << "\33[31mError: " << success << "\33[0m\n";
        cublasDestroy(handle);
    }

    // copy back to host
    cudaMemcpy(h_b.data(),d_b,nbytes,cudaMemcpyDeviceToHost);

    cout << "origional:  ";
    print_matrix(h_a.data(),m,n);
    cout << "\n";

    cout << "transposed: ";
    print_matrix(h_b.data(),n,m);
    cout << "\n";

    cudaFree(d_a);
    cudaFree(d_b);
    return 0;
}
philn
  • 654
  • 6
  • 17