small Matrix Inversion on CUDA

Question

I need a bit of advice from you, and I hope it won't take a lot of your time.

So here is my question: I have a small square dense matrix, with possible sizes 4x4, 8x8, 16x16, and I want to inverse it using CUDA.

The special part of the question is that I have 1024 idle cuda threads to perform this task. So I have a suspicion that the most widespread inverse methods like Gauss Jordan won't properly work here, because they are slightly parallel and will use only about 4-16 threads from huge amount of 1024.

But how else can I inverse this matrices using all available threads?

Thank you for your attention!

Use `cublasSgetriBatched` and `cublasSgetrfBatched` as outlined [here](http://stackoverflow.com/questions/22887167/cublas-incorrect-inversion-for-matrix-with-zero-pivot). — Robert Crovella, Mar 05 '15 at 22:28
If desired, and you have a device which supports dynamic parallelism, you can even call these from the device as outlined [here](http://stackoverflow.com/questions/27094612/cublas-matrix-inversion-from-device) — Robert Crovella, Mar 05 '15 at 22:34
Intresting idea, thank you. I'm not sure it will fit my project by other reasons, but I will examine it thoroughly. — Ilya Afanasiev, Mar 05 '15 at 22:37
I'd be surprised if you can find a method that efficiently utilizes more than one thread per matrix element. I'm pretty sure Gauss-Jordan can be crafted to use as much as one thread per matrix element, with some utility, as opposed to one thread per column which seems to be what you are implying. — Robert Crovella, Mar 05 '15 at 23:08
@IlyaAfanasiev: You might want to check the CUDA registered developer website. There was source code available for download (BSD license) implementing inverse of small matrices. Used Gauss-Jordan with one element per thread if I recall correctly. It may be listed under "batched solver". Depending on your project that may be easier to incorporate into your code base, otherwise I concur with Robert Crovella's suggestion. — njuffa, Mar 06 '15 at 00:39
@njuffa if you wanted to make an answer around that, I would upvote — Robert Crovella, Mar 06 '15 at 03:02
@Robert Crovella: I am hesitant, because I think that answers basically pointing to 3rd party content (requiring registration to boot) are probably just as ill-received on SO as are questions requesting pointers to such material. — njuffa, Mar 06 '15 at 04:16
But they still completle solve the problem, so it's okey. Here is the link for batched solver discussions https://devtalk.nvidia.com/default/topic/503069/batched-solver-code-available/ , if anybody will need it in the future. — Ilya Afanasiev, Mar 06 '15 at 08:59
@njuffa: I went ahead and added a community wiki entry from your comments. I hope that's OK with you. — talonmies, Nov 01 '15 at 15:38

score 1 · Answer 1 · answered Oct 25 '15 at 06:21

There are at least two possible ready made options for this sort of problem:

Use the batched solvers shipping in recent versions of the CUBLAS library
Use the BSD licensed Gauss-Jordan elimination device code functions which NVIDIA distribute to registered developers. These were intended to invert small matrices using one thread per matrix

[This answer was assembled from comments and added as a community wiki entry to get the question off the unanswered queue]

small Matrix Inversion on CUDA

1 Answers1