0

I have an application in which I need to carry out a lot of Norms, Dot Products and most importantly, Matrix Vector multiplications.

matrix and vectors are huge. Matrix dimension is tending to be a 100000x100000

the loop structure is:

while(condition)
/* usually iterations=dimension of matrix, so around 1 million iterations are *at least* required (if not more) */
matrix-vector multiplication
3 dot prods
2 norms

I am currently using Intel Fortran with Intel MKL. Will rewriting my codes in Intel C with Intel MKL help any? Has anyone ever carried out a benchmark of any kind (for DGEMV especially)? Rewriting codes is a major pain but I would not mind rewriting iff I see a reason to.

EDIT: I misspoke: The matrix dimensions are 100000 not a million. Pretty serious error :|

And yes, the matrix is dense and it needs to be dense. Moreover, it is not symmetric and not even positive definite. My algorithm is a modified version of QMR.

  • 2
    You are working with 4000 Gb dense matrices? Please do tell more.... – talonmies Jan 05 '12 at 17:33
  • 4
    I hope you know that a 1Mx1M-matrix of doubles requires 8 TB (8,000 GB) of memory. You're sure you really need a dense matrix? Your algorithm sounds like a typical itarative linear algebra algorithm only requiring matrix-vector products. I'm pretty sure your matrix has a sparse structure, for which there exist special datastructures not part of the standard BLAS routines. This would be the first point to look for optimizations, because this will speedup your code from O(n^3) to O(n^2), instead of just giving you some small speedup (if any) gained by switching languages. – Christian Rau Jan 05 '12 at 17:37
  • What is Intel C? You should use ISO C99 or some other standard language. And I write this as an Intel employee, so I am certainly not anti-Intel :-) – Jeff Hammond Jan 04 '15 at 00:54
  • In addition to what the others have said about Fortran vs. C not mattering, you might find that fusing the three dot products is worthwhile. This requires _not_ using BLAS1 routines (_dot) and instead writing loops, but since BLAS1 operations are bandwidth-bound, any decent compiler should be able to do as well as an optimized library. The advantage of fusing the three dot products is that it may better exploit streaming memory bandwidth capability of the processor and/or memory controller. And if there is any data reuse between the three, it will definitely be worthwhile. – Jeff Hammond Jan 04 '15 at 04:01

1 Answers1

9

The performance will be completely identical in either C or Fortran, as the actual implementation backing the library calls are the same, and essentially all of the time in your code is spent in those library calls.

Stephen Canon
  • 103,815
  • 19
  • 183
  • 269
  • Firstly, if I wasn't completely clear, this is a supercomputing problem and I have profiled the crap out the code. It turns out that a lot of time is wasted in OpenMP create/destroy functions and Matrix Vector (as expected since BLAS 1/2 aren't as simply parallelizable). It has no barrier or synchronization issues. If C can provide me inherent shortcuts (if any), I might want to rewrite. –  Jan 05 '12 at 17:33
  • 9
    Then you should provide the profiling information and ask for suggestions on that. The original question is just nonsensical; how would calling a library routine from C be faster than calling the same library routine from FORTRAN, especially since the routine was probably (at least at some point) largely written in FORTRAN? How would C "provide shortcuts", exactly? – Jonathan Dursi Jan 05 '12 at 17:56