As we all may know, there are lots of different ways to implement DGEMV in parallel (column or block -wise etc) resulting in different communication overheads. I have been looking through both the MKL and all the reference manuals to BLAS to try and figure out which style is in general being called in by cblas_dgemv from MKL(v.11) without success. If anyone has a reference that documents which algorithm or the overheads for the algorithm that is being used, I would be very happy.
Asked
Active
Viewed 341 times
1 Answers
0
MKL ref manuals keep DGEMV as well as other routines as black boxes.
But I think there is still some way to estimate the overhead/efficiency.
As we know, DGEMV is a mem bandwidth bounded operation. For y += A*x you could measure its speed by the mem bandwidth achieved:
- measure the running time for one DGEMV call as
t
; - compute total mem read/write size:
m = 2*len(y)+len(x)+len(A)
; - actual bandwidth
bw = m/t
; - check out the peak bandwidth of the total system RAM
bw0
;
Then bw/bw0*100%
can be seen as the actual efficiency of the algorithm.
Please note you may want a large enough matrix/vector to do the measurement. Also if you want repeat the measurement to get more accurate result, you may need to keep the cache cold before starting a new iteration.

kangshiyin
- 9,681
- 1
- 17
- 29