Accelerate uses the M1's AMX coprocessor to perform its matrix operations, it is not using the typical paths in the processor. As such, the accounting of CPU utilization doesn't make much sense; it appears to me that when a CPU core submits instructions to the AMX coprocessor, it is accounted as being held at 100% utilization while it waits for the coprocessor to finish its work.
We can see evidence of this by running multiple instances of your dgesv
benchmark in parallel, and watching as the runtime increases by a factor of two, but the CPU monitor simply shows two processes using 100% of one core:
clang -o dgesv_accelerate dgesv_ex.c -framework Accelerate
$ time ./dgesv_accelerate
real 0m36.563s
user 0m36.357s
sys 0m0.251s
$ ./dgesv_accelerate & ./dgesv_accelerate & time wait
[1] 6333
[2] 6334
[1]- Done ./dgesv_accelerate
[2]+ Done ./dgesv_accelerate
real 0m59.435s
user 1m57.821s
sys 0m0.638s
This implies that there is a shared resource that each dgesv_accelerate
process is consuming; one that we don't have much visibility into. I was curious as to whether these dgesv_accelerate
processes are actually consuming computational resources at all while waiting for the AMX coprocessor to finish its task, so I linked another version of your example against OpenBLAS, which is what we use as the default BLAS backend in the Julia language. I am using the code hosted in this gist which has a convenient Makefile for downloading OpenBLAS (and its attendant compiler support libraries such as libgfortran
and libgcc
) and compiling everything and running timing tests.
Note that because the M1 is a big.LITTLE architecture, we generally want to avoid creating so many threads that we schedule large BLAS operations on the "efficiency" cores; we mostly want to stick to only using the "performance" cores. You can get a rough outline of what is being used by opening the "CPU History" graph of Activity Monitor. Here is an example showcasing normal system load, followed by running OPENBLAS_NUM_THREADS=4 ./dgesv_openblas
, and then OPENBLAS_NUM_THREADS=8 ./dgesv_openblas
. Notice how in the four threads example, the work is properly scheduled onto the performance cores and the efficiency cores are free to continue doing things such as rendering this StackOverflow webpage as I am typing this paragraph, and playing music in the background. Once I run with 8 threads however, the music starts to skip, the webpage begins to lag, and the efficiency cores are swamped by a workload they're not designed to do. All that, and the timing doesn't even improve much at all:

$ OPENBLAS_NUM_THREADS=4 time ./dgesv_openblas
18.76 real 69.67 user 0.73 sys
$ OPENBLAS_NUM_THREADS=8 time ./dgesv_openblas
17.49 real 100.89 user 5.63 sys
Now that we have two different ways of consuming computational resources on the M1, we can compare and see if they interfere with eachother; e.g. if I launch an "Accelerate"-powered instances of your example, will it slow down the OpenBLAS-powered instances?
$ OPENBLAS_NUM_THREADS=4 time ./dgesv_openblas
18.86 real 70.87 user 0.58 sys
$ ./dgesv_accelerate & OPENBLAS_NUM_THREADS=4 time ./dgesv_openblas
24.28 real 89.84 user 0.71 sys
So, sadly, it does appear that the CPU usage is real, and that it consumes resources that the OpenBLAS version wants to use. The Accelerate version also gets a little slower, but not by much.
In conclusion, the CPU usage numbers for an Accelerate-heavy process are misleading, but not totally so. There do appear to be CPU resources that Accelerate is using, but there is a hidden shared resource that multiple Accelerate processes must fight over. Using a non-AMX library such as OpenBLAS results in more familiar performance (and better runtime, in this case, although that is not always the case). The truly "optimal" usage of the processor would likely be to have something like OpenBLAS running on 3 Firestorm cores, and one Accelerate process:
$ OPENBLAS_NUM_THREADS=3 time ./dgesv_openblas
23.77 real 68.25 user 0.32 sys
$ ./dgesv_accelerate & OPENBLAS_NUM_THREADS=3 time ./dgesv_openblas
28.53 real 81.63 user 0.40 sys
This solves two problems at once, one taking 28.5s and one taking 42.5s (I simply moved the time
to measure the dgesv_accelerate
). This slowed the 3-core OpenBLAS down by ~20% and the Accelerate by ~13%, so assuming that you have an application with a very long queue of these problems to solve, you could feed them to these two engines and solve them in parallel with a modest amount of overhead.
I am not claiming that these configurations are actually optimal, just exploring what the relative overheads are for this particular workload because I am curious. :) There may be ways to improve this, and this all could change dramatically with a new Apple Silicon processor.