0

I would like to use CHOLMOD's GPU acceleration, and have found several simple examples on how to use the library for Cholesky decomposition. However all of the examples provide the matrices to CHOLMOD in host memory, and allow it to copy them to the device. The project I'm working on already has these matrices resident in device memory as they have been built in parallel, and more processing will be performed on the GPU after Cholesky decomposition is performed.

My question is: Is it possible to interface CHOLMOD directly with device memory? To avoid copying to host memory, simply to allow CHOLMOD to copy it back to the device?

Apologies if this is not the correct place to ask this question, If someone can point me to a users forum that would be great too.

nitronoid
  • 1,459
  • 11
  • 26
  • I believe the answer is no. Cholmod (as I recall) still performs a lot,of operations on the host and uses asynchronous CUBLAS calls to accelerate only part of the factorization procesa – talonmies Mar 25 '19 at 05:48
  • @talonmies hm so that would make it impossible to avoid copying back to the host side, but maybe there's a way to get around copying it to the device again? I don't mind copying back to the host as that could be asynchronous but copying to the device again seems wasteful. Thanks for your reply – nitronoid Mar 25 '19 at 11:19
  • The sparse matrix is never copied to the device in a code like Cholmod. Dense submatrices resulting from the factorization are copied to the device, processed and copied back to the device asynchronously while the host CPU simultaneously performs other operations. There is not (or certainly used not to be) a single line of CUDA code in the Cholmod tree. Just CUBLAS calls to perform GEMM and Rank-k updates on dense blocks of the sparse matrix – talonmies Mar 25 '19 at 11:27
  • @talonmies I see, thank you. Do you have any recommendations for cuda based solvers? The performance of CG solvers I tried on the gpu are orders of magnitude shower than cholmod. I also had a look at the Nvidia samples and the linear solver demo performs reordering on the CPU to compete. (Sparse positive definite, ~100k Square) – nitronoid Mar 25 '19 at 11:42
  • it's not documented, instead "experimental", but there is a device API for the cusolver library. see [here](https://devtalk.nvidia.com/default/topic/1048452/gpu-accelerated-libraries/sparse-cusolver-inside-loop-factorization-at-every-call-/) or [here](https://devtalk.nvidia.com/default/topic/1046216/gpu-accelerated-libraries/solving-linear-system-with-cusparse/) – Robert Crovella Mar 27 '19 at 01:50
  • @RobertCrovella This is pretty great. Is there any similar stuff for performing reordering or matrix permutation on the GPU? For some of my test matrices I can't get past the LLT factor, and I believe it's due to not reordering my matrix. For the linear solver demo in the samples, using metis reordering drastically reduces the computation time. – nitronoid Mar 27 '19 at 13:04
  • @RobertCrovella I have opened a new question for this. https://stackoverflow.com/questions/55382679/perform-matrix-reordering-on-device-cuda – nitronoid Mar 27 '19 at 16:57

1 Answers1

0

No, as Cholmod only does part of the factorization on the GPU. A host copy is required for matrix reordering and other parts of the factorization.

nitronoid
  • 1,459
  • 11
  • 26