CUDA CSR Matrix-Matrix product transpose by itself

Question

I have a very large, very sparse least-squares design matrix (A), which I would like to multiply by itself, as follows: N = A^T * A, where A & N are stored in CSR format. Obviously, A has more rows than columns. I normally form N directly row-by-row, but in the case of CSR, I would have to form a graph first, in order to determine which elements of N are non-zero. I could do this (and even have some old c-code), but I am hoping to get to a solution with less development. I am using CUDA, so this could be done on either the GPU or CPU, where I could see advantages of using the GPU. I have sketched out an algorithm, but was hoping that this problem had already been solved. I could not find anything in the CUDA toolkit, other than the direct A * x = l QR solver (where A=(m,n)). Google was also not very helpful.

I am using C++.

Does anyone have any experience here?

It is not obvious to me that A has more rows than columns. Nevertheless, I have sketched an algorithm in my answer below. — tera, Mar 10 '19 at 16:31

tera · Answer 1 · 2019-03-10T16:34:35.843

3

Ordering of a general COO sparse matrix into CSR/CSC format, and specifically transposition / conversion between CSR and CSC formats are relatively cheap operations and readily available in the cuSPARSE library.

After conversion of your matrix A from CSR format to CSC, you can readily apply the trivial algorithm to compute N = A^T * A.
This can also easily be parallelised with CUDA by having each thread process one column of A to generate one output.

edited Mar 10 '19 at 16:34

answered Mar 10 '19 at 16:29

tera

7,080
1
21
32

Thanks. I just worked out that my A matrix will generally be < 50MB, so my concerns about memory usage are unfounded. The GPU probably will not even be much faster than the CPU, but I am using it for other parts, so it is really the easiest way of going forward. Hence, this approach results in far much less code than doing the CSR' x CSR multiplication, which I had sketched out, which only makes sense if the number of non-zero elements per row is small (which is true in my case). Thanks. – Dar Cos Mar 10 '19 at 18:41

score 1 · Answer 2 · answered Mar 10 '19 at 21:17

just noticed that cuSparse in the CUDA toolkit actually has a csr-gemm, which supports transpose on either matrix. I don't know how I overlooked this. See https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-lt-t-gt-csrgemm. Looks like the simplest solution...

CUDA CSR Matrix-Matrix product transpose by itself

2 Answers2