I'm testing the performance of DGEMM and SGEMM on multiple libraries on the Apple M1 with a program that does the following: set dimensions 1000x1000, call cblas_dgemm using alpha and beta as 2 and repeat with dimensions 2000x2000, 3000x3000, etc. This means that for every iteration, dgemm will: do alpha * (AxB) and the result of that to is added to the result of beta * C.
My idea would be to do the big workload (alpha * (AxB)) on a high performance core, the beta*C on the efficiency core and join the work of both cores (in case the efficiency core takes longer than the performance core, the performance core would start with the next iteration (2000x2000)).
My question is: is there any real way to do this? I'm a bit noobish and not sure if it's doable. Another approach I thought would be to divide the workload between the two cores in real time, but Apple doesn't make selecting the cores in C really easy. Thanks in advance
Asked
Active
Viewed 35 times
0

Javier
- 1
- 1
-
The main issue is that BLAS libraries are not designed for big-little processors like the Apple-M1. The use all the available cores by default since it is almost always the best solution. Trying to use both at the same time (assuming the OS/processor allow this) the way you do will likely cause a work imbalance that can result in a slower execution. The resulting performance is likely not portable between multiple processors (note that there is multiple variant of M1 already so far). – Jérôme Richard Apr 18 '22 at 17:26
-
Besides this, SGEMM for such large matrices are certainly faster on the GPU, especially on the M1 since AFAIK the memory is shared with the GPU so the transfer time should not be an issue compared to usual discrete GPUs. GPU are far more efficient for this. For the DGEMM, the CPU should be certainly still faster. – Jérôme Richard Apr 18 '22 at 17:28