-1

For two matrices X and Q of size 4x3 and 2x3 which in memory look like

x = [0 1 2 3 4 5 6 7 8 9 10 11]
q = [3 4 5 6 7 8]

I tried to use cublas multiplication cublasSgemm, but I couldn't manage to get expected results.

Since they are stored in row-major order so they should be interpreted as 3x4 and 3x2 so it seemed for me that

cublasSgemm(cublas_handle,
    CUBLAS_OP_T, CUBLAS_OP_N,
    q_rows_num, x_rows_num, dim,
    &alpha, // 1
    q_device, q_rows_num,
    x, x_rows_num,
    &beta, // 0
    x_q_multiplication, q_rows_num);

where

dim = 3
x_rows_num = 4
q_rows_num = 2

would work but in that case I got error

** On entry to SGEMM  parameter number 8 had an illegal value

I also tried shuffling parameters a bit but I couldn't find any setup that would work.

So is it possible to multiply them without changing to column-major order?

EDIT:

So I got exepected results with changes made in this working example:

#include <cublas_v2.h>

#include <iostream>
#include <cuda.h>
#include <cuda_runtime.h>

int main()
{
    int x_rows_num = 4;
    int q_rows_num = 2;
    int dim = 3;

int N = x_rows_num*dim;
int M = q_rows_num*dim;


float *x, *q, *x_q_multiplication;
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&q, M*sizeof(float));
cudaMallocManaged(&x_q_multiplication, q_rows_num*x_rows_num*dim);

for (int i = 0; i< N; i++) x[i] = i*1.0f;
for (int i = 0; i< M; i++) q[i] = (i + 3)*1.0f;

float *q_device;
cudaMallocManaged(&q_device, M*sizeof(float));
cudaMemcpy(q_device, q, M*sizeof(float), cudaMemcpyHostToDevice);

cublasHandle_t handle;
cublasCreate(&handle);

float alpha = 1.f;
float beta = 0.f;
cublasSgemm(handle,
    CUBLAS_OP_T, CUBLAS_OP_N,
    x_rows_num, q_rows_num, dim,
    &alpha, 
    x, dim, 
    q, dim,  
    &beta, 
    x_q_multiplication, x_rows_num);
cudaDeviceSynchronize();

for (int i = 0; i < q_rows_num*x_rows_num; i++) std::cout << x_q_multiplication[i] << " ";

cudaFree(x);
cudaFree(q);
cudaFree(x_q_multiplication);
return 0;
}

However I'am still not sure why dim became leading dimension

zinsek
  • 9
  • 2
  • `cublasSgemm(cublas_handle, CUBLAS_OP_T, CUBLAS_OP_N, q_rows_num, x_rows_num, dim, &alpha, q_device, dim, x, dim, &beta, x_q_multiplication, q_rows_num);` worked but I have no idea why, I thought that leading dim will always be the original number of rows. – zinsek Feb 19 '18 at 23:05
  • @RobertCrovella regarding your first comment I enclosed example in original post with changes to leading dimension. Regarding your second comment I feel a little offended because as you could see in original example (cublasSgemm execution) I wanted to multiply q^t * x and with interpretation of cublas it would be 2x3 * 3x4 matrix multiplication but it seems that you stopped reading before it. I also thought that it is obvious that expected results would be just results of that multiplication in whatever order. – zinsek Feb 20 '18 at 10:34

1 Answers1

2

Your original CUBLAS call:

cublasSgemm(cublas_handle,
    CUBLAS_OP_T, CUBLAS_OP_N,
    q_rows_num, x_rows_num, dim,
    &alpha, // 1
    q_device, q_rows_num,
    x, x_rows_num,
    &beta, // 0
    x_q_multiplication, q_rows_num);

was close to correct. Your interpretation of what the leading dimensions should be was correct. What you got wrong was the Op specifiers. If both matrices are row major ordered and the first array needs to be read in its (row major) transposed order, then the operation should be:

#include <cublas_v2.h>

#include <cstring>
#include <iostream>
#include <cuda.h>
#include <cuda_runtime.h>

int main()
{
    int x_rows_num = 4;
    int q_rows_num = 2;
    int dim = 3;

    int N = x_rows_num*dim;
    int M = q_rows_num*dim;

    float x0[12] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11};
    float q0[6]  = {3, 4, 5, 6, 7, 8 };

    float *x, *q, *x_q_multiplication;
    cudaMallocManaged(&x, N*sizeof(float));
    cudaMallocManaged(&q, M*sizeof(float));
    cudaMallocManaged(&x_q_multiplication, q_rows_num*x_rows_num*dim);

    std::memcpy(x, x0,  N*sizeof(float));
    std::memcpy(q, q0,  M*sizeof(float));

    float *q_device;
    cudaMallocManaged(&q_device, M*sizeof(float));
    cudaMemcpy(q_device, q, M*sizeof(float), cudaMemcpyHostToDevice);

    cublasHandle_t handle;
    cublasCreate(&handle);

    float alpha = 1.f;
    float beta = 0.f;
    cublasSgemm(handle,
            CUBLAS_OP_N, CUBLAS_OP_T,
            q_rows_num, x_rows_num, dim,
            &alpha, // 1
            q_device, q_rows_num,
            x, x_rows_num,
            &beta, // 0
            x_q_multiplication, q_rows_num);

    cudaDeviceSynchronize();

    for (int i = 0; i < q_rows_num*x_rows_num; i++) std::cout << x_q_multiplication[i] << " "; std::cout << std::endl;

    cudaFree(x);
    cudaFree(q);
    cudaFree(x_q_multiplication);
    return 0;
}

which does this for me:

$ nvcc -arch=sm_52 cublas_trans.cu -o cublas_trans -lcublas 
$ ./cublas_trans 
76 88 91 106 106 124 121 142 

and which I believe is the correct answer.

Incidentally, Robert Crovella's now deleted comment, which you say you take offense to was 100% correct. I suspect he read, as I did, your original CUBLAS call, interpreted the arguments and concluded, as I did, and as CUBLAS itself did, that you are trying to multiply a 3x4 matrix and a 3x2 matrix. Which is why the invalid argument error was raised.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • I think expected result would be something like in this [link](https://www.wolframalpha.com/input/?i=%5B%5B3,4,5%5D,%5B6,7,8%5D%5D*%5B%5B0,1,2%5D,%5B3,4,5%5D,%5B6,7,8%5D,%5B9,10,11%5D%5D%5ET) and results like this I got in my enclosed main method. Another thing maybe I got wrong, is that in your example I can see that you want to multiply Q*X. They common dimension is 3 so assuming that we have to interpret Q as transposition because of column/row major ordering so I think Q should be transposed to nullify ordering transform. – zinsek Feb 20 '18 at 16:39
  • The alpha you linked to is not what is described in your question. What is described in your question is this: https://pastebin.com/LZhL0064 – talonmies Feb 21 '18 at 06:46