Matrix solving with C (within CUDA)

Question

As part of a larger problem, I need to solve small linear systems (i.e NxN where N ~10) so using the relevant cuda libraries doesn't make any sense in terms of speed.

Unfortunately something that's also unclear is how to go about solving such systems without pulling in the big guns like GSL, EIGEN etc.

Can anyone point me in the direction of a dense matrix solver (Ax=B) in straight C?

For those interested, the basic structure of the generator for this section of code is:

ndarray=some.generator(N,N)
for v in range N:
    B[v]=_F(v)*constant
    for x in range N:
        A[v,x]=-_F(v)*ndarray[x,v]

Unfortunately I have approximately zero knowledge of higher mathematics, so any advice would be appreciated.

UPDATE: I've been working away at this, and have a nearly-solution that runs but isn't working. Anyone lurking is welcome to check out what I've got so far on pastebin.

I'm using Crout Decomposition with Pivoting which seems to be the most general approach. The idea for this test is that every thread does the same work. Boring I know, but the plan is that the matrixcount variable is increased, actual data is put in, and each thread solves the small matrices individually.

Thanks for everyone who's been checking on this.

POST-ANSWER UPDATE: Finished the matrix solving code for CPU and GPU operation, check out my lazy-writeup here

@talonmies hello again! Its a multi-user communications problem where this small section is a quantative measure of additional bitloading cost. I'm looking at implementing Crout for the time being. Basically looking for something that does Ax=B from within a (py)CUDA environment — Bolster, Apr 13 '11 at 19:07
@Andrew Bolster: I was meaning whether the matrix to factorise has any properties or structure you can exploit or which prescribes a particular numerical method for stability reasons. — talonmies, Apr 13 '11 at 19:48
@Talonmies Updated question, but my little knowledge says 'go with a generic approach'. Maybe you can spot a smarter way. — Bolster, Apr 13 '11 at 21:39
The good old Numerical Recipes books have whole chapters dedicated to matrix factorization routines. My suggestion is to use the Numerical Recipes in Fortran book as a reference - serial Fortran is often easier to port into device code that something more "fancy". A naive LU factorization (which is precisely what you want) is probably only about 25-30 lines of code. The backward and forward subsitution routines are only another 10 each. Template them and let the compiler unroll the loops. — talonmies, Apr 14 '11 at 15:22
@talonmies Implemented what I thought was appropriate, but I think I've done something stupid in the implementation. — Bolster, Apr 14 '11 at 18:20

score 0 · Accepted Answer · answered Apr 14 '11 at 13:56

0

CUDA won't help here, that's true. Matrices like that are just too small for it.

What you do to solve a system of linear equations is LU decomposition:

Or even better a QR decomposition with Householder reflections like in the Gram-Schmidt process.

http://en.wikipedia.org/wiki/QR_decomposition#Computing_the_QR_decomposition

Solving the linear equation becomes easy afterwards, but I'm afraid there always is some "higher mathematics" (linear algebra) involved. That, and there are many (many!) C libraries out there for solving linear equations. Doesn't seem like "big guns" to me.

answered Apr 14 '11 at 13:56

Jonas Bötel

4,452
1
19
28

I'm not looking to gpuify this bit; I want each thread to do its own solving of the small matrix – Bolster Apr 14 '11 at 14:40
1

You'd better off if each *block* solves its own small matrix. Looping over a thread solving a single matrix is a bad idea. If not anything, the performance hit taken by asynchronous reads is huge. – Pavan Yalamanchili Apr 15 '11 at 04:48

Matrix solving with C (within CUDA)

1 Answers1

Linked