I have a very large, very sparse least-squares design matrix (A), which I would like to multiply by itself, as follows: N = A^T * A, where A & N are stored in CSR format. Obviously, A has more rows than columns. I normally form N directly row-by-row, but in the case of CSR, I would have to form a graph first, in order to determine which elements of N are non-zero. I could do this (and even have some old c-code), but I am hoping to get to a solution with less development. I am using CUDA, so this could be done on either the GPU or CPU, where I could see advantages of using the GPU. I have sketched out an algorithm, but was hoping that this problem had already been solved. I could not find anything in the CUDA toolkit, other than the direct A * x = l QR solver (where A=(m,n)). Google was also not very helpful.
I am using C++.
Does anyone have any experience here?