5

I am trying to use CUBLAS to sum two big matrices of unknown size. I need a fully optimized code (if possible) so I chose not to rewrite the matrix addition code (simple) but using CUBLAS, in particular the cublasSgemm function which allows to sum A and C (if B is a unit matrix): *C = alpha*op(A)*op(B)+beta*c*

The problem is: C and C++ store the matrices in row-major format, cublasSgemm is intended (for fortran compatibility) to work in column-major format. You can specify whether A and B are to be transposed first, but you can NOT indicate to transpose C. So I'm unable to complete my matrix addition..

I can't transpose the C matrix by myself because the matrix is something like 20000x20000 maximum size.

Any idea on how to solve please?

talonmies
  • 70,661
  • 34
  • 192
  • 269
Marco A.
  • 43,032
  • 26
  • 132
  • 246
  • If you're just adding the matricies, it doesn't actually matter, right? You give it alpha, Aij, beta, and Cij. It _thinks_ you're giving it alpha, Aji, beta, and Cji, and gives you what it thinks is Cji = beta Cji + alpha Aji. But that's the correct Cij as far as you're concerned. My worry is when you start going to things which _do_ matter -- like matrix products. There, there's likely no working around it. – Jonathan Dursi Mar 25 '11 at 21:38
  • 3
    But more to the point, you _don't_ want to be using GEMM to do matrix _addition_ -- you're doing a completely pointless matrix multiplication (which takes takes ~20,000^3 operations and many passes through memory) just to do ~20,000^2 operations in a single pass! Treat the matricies as 20,000^2-long vectors and use saxpy. – Jonathan Dursi Mar 25 '11 at 21:43
  • Many thanks to you for the saxpy solution, it was excellent! So it's totally impossible to implement a (performanced) version of matrix multiplication with cublas, right? I should code it myself? – Marco A. Mar 25 '11 at 22:27
  • 1
    Matrix multiplication is memory-bandwidth intensive, so there is a _huge_ (factors of 10x or 100x) difference in performance between coding it yourself and a tuned version. Ideally, you'd change structures in your code to match the library. If you can't, you can just use linear algebra identities. The C-vs-Fortran ordering means that when you pass in A, CUBLAS "sees" A^T (A transpose). So that's ok. If what you want is C=A.B, pass in B.A . Then the library sees (B^T . A^T), and calculates C^T = (A.B)^T; and then when it passes back C^T, you get (in your ordering) C. Test it and see. – Jonathan Dursi Mar 25 '11 at 23:57
  • You solved my problem, thank you Jonathan, thank you so much! – Marco A. Mar 26 '11 at 11:00
  • @Jonathan: Can you put your comments into an answer so we can get this question off the Unanswered list? Thanks! – Bill the Lizard Mar 29 '11 at 13:33
  • No problem; I just cut-and-pasted my comments above into an answer. Done! – Jonathan Dursi Mar 29 '11 at 23:46

2 Answers2

8

cublasgeam has been added to CUBLAS5.0. It computes the weighted sum of 2 optionally transposed matrices

6

If you're just adding the matrices, it doesn't actually matter. You give it alpha, Aij, beta, and Cij. It thinks you're giving it alpha, Aji, beta, and Cji, and gives you what it thinks is Cji = beta Cji + alpha Aji. But that's the correct Cij as far as you're concerned. My worry is when you start going to things which do matter -- like matrix products. There, there's likely no working around it.

But more to the point, you don't want to be using GEMM to do matrix addition -- you're doing a completely pointless matrix multiplication (which takes takes ~20,0003 operations and many passes through memory) for an operatinon which should only require ~20,0002 operations and a single pass! Treat the matricies as 20,000^2-long vectors and use saxpy.

Matrix multiplication is memory-bandwidth intensive, so there is a huge (factors of 10x or 100x) difference in performance between coding it yourself and a tuned version. Ideally, you'd change structures in your code to match the library. If you can't, in this case you can manage just by using linear algebra identities. The C-vs-Fortran ordering means that when you pass in A, CUBLAS "sees" AT (A transpose). Which is fine, we can work around it. If what you want is C=A.B, pass in the matricies in the opposite order, B.A . Then the library sees (BT . AT), and calculates CT = (A.B)T; and then when it passes back CT, you get (in your ordering) C. Test it and see.

Jonathan Dursi
  • 50,107
  • 9
  • 127
  • 158