Alg. MKL Threaded DGEMV

Question

As we all may know, there are lots of different ways to implement DGEMV in parallel (column or block -wise etc) resulting in different communication overheads. I have been looking through both the MKL and all the reference manuals to BLAS to try and figure out which style is in general being called in by cblas_dgemv from MKL(v.11) without success. If anyone has a reference that documents which algorithm or the overheads for the algorithm that is being used, I would be very happy.

kangshiyin · Answer 1 · 2013-01-14T20:50:16.663

MKL ref manuals keep DGEMV as well as other routines as black boxes.

But I think there is still some way to estimate the overhead/efficiency.

As we know, DGEMV is a mem bandwidth bounded operation. For y += A*x you could measure its speed by the mem bandwidth achieved:

measure the running time for one DGEMV call as t;
compute total mem read/write size: m = 2*len(y)+len(x)+len(A);
actual bandwidth bw = m/t;
check out the peak bandwidth of the total system RAM bw0;

Then bw/bw0*100% can be seen as the actual efficiency of the algorithm.

Please note you may want a large enough matrix/vector to do the measurement. Also if you want repeat the measurement to get more accurate result, you may need to keep the cache cold before starting a new iteration.

Alg. MKL Threaded DGEMV

1 Answers1

Linked