-1

I have a very large, very sparse least-squares design matrix (A), which I would like to multiply by itself, as follows: N = A^T * A, where A & N are stored in CSR format. Obviously, A has more rows than columns. I normally form N directly row-by-row, but in the case of CSR, I would have to form a graph first, in order to determine which elements of N are non-zero. I could do this (and even have some old c-code), but I am hoping to get to a solution with less development. I am using CUDA, so this could be done on either the GPU or CPU, where I could see advantages of using the GPU. I have sketched out an algorithm, but was hoping that this problem had already been solved. I could not find anything in the CUDA toolkit, other than the direct A * x = l QR solver (where A=(m,n)). Google was also not very helpful.

I am using C++.

Does anyone have any experience here?

Dar Cos
  • 87
  • 1
  • 1
  • 3
  • It is not obvious to me that A has more rows than columns. Nevertheless, I have sketched an algorithm in my answer below. – tera Mar 10 '19 at 16:31

2 Answers2

3

Ordering of a general COO sparse matrix into CSR/CSC format, and specifically transposition / conversion between CSR and CSC formats are relatively cheap operations and readily available in the cuSPARSE library.

After conversion of your matrix A from CSR format to CSC, you can readily apply the trivial algorithm to compute N = A^T * A.
This can also easily be parallelised with CUDA by having each thread process one column of A to generate one output.

tera
  • 7,080
  • 1
  • 21
  • 32
  • Thanks. I just worked out that my A matrix will generally be < 50MB, so my concerns about memory usage are unfounded. The GPU probably will not even be much faster than the CPU, but I am using it for other parts, so it is really the easiest way of going forward. Hence, this approach results in far much less code than doing the CSR' x CSR multiplication, which I had sketched out, which only makes sense if the number of non-zero elements per row is small (which is true in my case). Thanks. – Dar Cos Mar 10 '19 at 18:41
1

just noticed that cuSparse in the CUDA toolkit actually has a csr-gemm, which supports transpose on either matrix. I don't know how I overlooked this. See https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-lt-t-gt-csrgemm. Looks like the simplest solution...

Dar Cos
  • 87
  • 1
  • 1
  • 3