0

What is the best way to utilize OpenMP with a matrix-vector product? Would the for directive suffice (if so, where should I place it? I assume outer loop would be more efficient) or would I need schedule, etc..?

Also, how would I take advantage different algorithms to attempt this m-v product most efficiently?

Thanks

1 Answers1

3

The first step you should take is the obvious one, wrap the outermost loop in a parallel for directive. As you assume. It's always worth experimenting a bit to get some evidence to support your (and my) assumptions, but if you were only allowed to make 1 change that would be the one to make.

I don't know much about cache-oblivious algorithms but I understand that they, generally, work by recursive division of a problem into sub-problems. This doesn't seem to fit with the application of parallel for directives. I suspect you could implement such an algorithm with OpenMP's tasks, but I suspect that the overhead of doing this would outweigh any execution improvements on any m-v product of reasonable dimensions.

(If you demonstrate the falsity of this argument on m-v products of size N I will retort 'N's not a reasonable dimension'. As ever with these performance questions, evidence trumps argument every time.)

Finally, depending on your compiler and the availability of libraries, you may not need to use OpenMP for m-v calculations, you might find auto-parallelisation works efficiently, or already have a library implementation which multi-threads this sort of computation.

High Performance Mark
  • 77,191
  • 7
  • 105
  • 161
  • Thanks HPM. What about schedule directive or other OpenMP directives? Any other suggestions of OpenMP directives to try out besides for? –  Mar 27 '12 at 07:37
  • 1
    No not really, for matrix-vector products I'm not convinced that looking beyond parallel for and the schedule clause will be worth your while. But that's my view, based on my experience; in parallel programming you really need to start developing your own views, so for self-education you may want to look at other directives. I'll be interested to see any other responses you get, especially ones which flat-out contradict mine, and provide evidence in support. – High Performance Mark Mar 27 '12 at 07:43
  • HPM, maybe I should have mentioned this but I plan to run an insanely huge matrix and vector in the order of 100,000 elements... Would this change your answer? –  Mar 27 '12 at 16:26
  • 1
    Not until shown evidence. If you're writing a program to be used again and again then it's worth your while investigating all the options I've suggested and any others you can think of and coming to your own conclusions, based on evidence. If you're writing a program which will not be used repeatedly stop wasting time optimising it, get it running and throw it away when finished. – High Performance Mark Mar 27 '12 at 16:30
  • HPM, I implemented some algorithms and tried different OpenMP directives.. You were right, the overhead is just killing it. It behaved worse than the serial case for some algorithms (even with OpenMP)! Thanks again for the insight! To clarify: using the simple "parallel for" for the outer loop produces the fastest results. –  Mar 28 '12 at 01:36
  • If your aim is high performance, then explicit use of the SSE intrinsics (combined with careful memory alignment to avoid unaligned access) at the inner-most loops is an obvious step which can give you a factor of up to 4. – Walter Mar 31 '12 at 14:25