Accelerate framework uses only one core on Mac M1

Question

The following C program (dgesv_ex.c)

#include <stdlib.h>
#include <stdio.h>

/* DGESV prototype */
extern void dgesv( int* n, int* nrhs, double* a, int* lda, int* ipiv,
                double* b, int* ldb, int* info );

/* Main program */
int main() {
        /* Locals */
        int n = 10000, info;
        /* Local arrays */
        /* Initialization */
        double *a = malloc(n*n*sizeof(double));
        double *b = malloc(n*n*sizeof(double));
        int *ipiv = malloc(n*sizeof(int));
        for (int i = 0; i < n*n; i++ )
        {
                a[i] = ((double) rand()) / ((double) RAND_MAX) - 0.5;
        }
        for(int i=0;i<n*n;i++)
        {
            b[i] = ((double) rand()) / ((double) RAND_MAX) - 0.5;
        }

        /* Solve the equations A*X = B */
        dgesv( &n, &n, a, &n, ipiv, b, &n, &info );
        free(a);
        free(b);
        free(ipiv);
        exit( 0 );
} /* End of DGESV Example */

compiled on a Mac mini M1 with the command

clang -o dgesv_ex dgesv_ex.c -framework accelerate

uses only one core of the processor (as also shown by the activity monitor)

me@macmini-M1 ~ % time ./dgesv_ex 
./dgesv_ex  35,54s user 0,27s system 100% cpu 35,758 total

I checked that the binary is of the right type:

me@macmini-M1 ~  % lipo -info dgesv
Non-fat file: dgesv is architecture: arm64

As a comparaison, on my Intel MacBook Pro I get the following output :

me@macbook-intel ˜ % time ./dgesv_ex
./dgesv_ex  142.69s user 0,51s system 718% cpu 19.925 total

Is it a known problem ? Maybe a compilation flag or else ?

staticfloat · Answer 1 · 2021-05-18T23:29:07.973

Accelerate uses the M1's AMX coprocessor to perform its matrix operations, it is not using the typical paths in the processor. As such, the accounting of CPU utilization doesn't make much sense; it appears to me that when a CPU core submits instructions to the AMX coprocessor, it is accounted as being held at 100% utilization while it waits for the coprocessor to finish its work.

We can see evidence of this by running multiple instances of your dgesv benchmark in parallel, and watching as the runtime increases by a factor of two, but the CPU monitor simply shows two processes using 100% of one core:

clang -o dgesv_accelerate dgesv_ex.c -framework Accelerate
$ time ./dgesv_accelerate

real    0m36.563s
user    0m36.357s
sys     0m0.251s

$ ./dgesv_accelerate & ./dgesv_accelerate & time wait
[1] 6333
[2] 6334
[1]-  Done                    ./dgesv_accelerate
[2]+  Done                    ./dgesv_accelerate

real    0m59.435s
user    1m57.821s
sys     0m0.638s

This implies that there is a shared resource that each dgesv_accelerate process is consuming; one that we don't have much visibility into. I was curious as to whether these dgesv_accelerate processes are actually consuming computational resources at all while waiting for the AMX coprocessor to finish its task, so I linked another version of your example against OpenBLAS, which is what we use as the default BLAS backend in the Julia language. I am using the code hosted in this gist which has a convenient Makefile for downloading OpenBLAS (and its attendant compiler support libraries such as libgfortran and libgcc) and compiling everything and running timing tests.

Note that because the M1 is a big.LITTLE architecture, we generally want to avoid creating so many threads that we schedule large BLAS operations on the "efficiency" cores; we mostly want to stick to only using the "performance" cores. You can get a rough outline of what is being used by opening the "CPU History" graph of Activity Monitor. Here is an example showcasing normal system load, followed by running OPENBLAS_NUM_THREADS=4 ./dgesv_openblas, and then OPENBLAS_NUM_THREADS=8 ./dgesv_openblas. Notice how in the four threads example, the work is properly scheduled onto the performance cores and the efficiency cores are free to continue doing things such as rendering this StackOverflow webpage as I am typing this paragraph, and playing music in the background. Once I run with 8 threads however, the music starts to skip, the webpage begins to lag, and the efficiency cores are swamped by a workload they're not designed to do. All that, and the timing doesn't even improve much at all:

$ OPENBLAS_NUM_THREADS=4 time ./dgesv_openblas 
       18.76 real        69.67 user         0.73 sys
$ OPENBLAS_NUM_THREADS=8 time ./dgesv_openblas 
       17.49 real       100.89 user         5.63 sys

Now that we have two different ways of consuming computational resources on the M1, we can compare and see if they interfere with eachother; e.g. if I launch an "Accelerate"-powered instances of your example, will it slow down the OpenBLAS-powered instances?

$ OPENBLAS_NUM_THREADS=4 time ./dgesv_openblas
       18.86 real        70.87 user         0.58 sys

$ ./dgesv_accelerate & OPENBLAS_NUM_THREADS=4 time ./dgesv_openblas
       24.28 real        89.84 user         0.71 sys

So, sadly, it does appear that the CPU usage is real, and that it consumes resources that the OpenBLAS version wants to use. The Accelerate version also gets a little slower, but not by much.

In conclusion, the CPU usage numbers for an Accelerate-heavy process are misleading, but not totally so. There do appear to be CPU resources that Accelerate is using, but there is a hidden shared resource that multiple Accelerate processes must fight over. Using a non-AMX library such as OpenBLAS results in more familiar performance (and better runtime, in this case, although that is not always the case). The truly "optimal" usage of the processor would likely be to have something like OpenBLAS running on 3 Firestorm cores, and one Accelerate process:

$ OPENBLAS_NUM_THREADS=3 time ./dgesv_openblas
       23.77 real        68.25 user         0.32 sys
$ ./dgesv_accelerate & OPENBLAS_NUM_THREADS=3 time ./dgesv_openblas
       28.53 real        81.63 user         0.40 sys

This solves two problems at once, one taking 28.5s and one taking 42.5s (I simply moved the time to measure the dgesv_accelerate). This slowed the 3-core OpenBLAS down by ~20% and the Accelerate by ~13%, so assuming that you have an application with a very long queue of these problems to solve, you could feed them to these two engines and solve them in parallel with a modest amount of overhead.

I am not claiming that these configurations are actually optimal, just exploring what the relative overheads are for this particular workload because I am curious. :) There may be ways to improve this, and this all could change dramatically with a new Apple Silicon processor.

score 6 · Accepted Answer · answered Oct 06 '21 at 03:05

The original poster and the commenter are both somewhat unclear on exactly how AMX operates. That's OK, it's not obvious! For pre-A15 designs the setup is:

(a) Each cluster (P or E) has ONE AMX unit. You can think of it as being more an attachment of the L2 than of a particular core. (b) This unit has four sets of registers, one for each core. (c) An AMX unit gets its instructions from the CPU (sent down the Load/Store pipeline, but converted at some point to a transaction that is sent to the L2 and so the AMX unit).

Consequences of this include that

AMX instructions execute out of order on the core just like other instructions, interleaved with other instructions, and the CPU will do all the other sort of overhead you might expect (counter increments, maybe walking and derefencing sparse vectors/matrices) in parallel with AMX. A core that is running a stream of AMX instructions will look like a 100% utilized core. Because it is! (100% doesn't mean every cycle the CPU is executing at full width; it means the CPU never gives up any time to the OS for whatever reason).
ideally data for AMX is present in L2. If present in L1, you lose a cycle or three in the transfer to L2 before AMX can access it.
(most important for this question) there is no value in having multiple cores running AMX code to solve a single problem. They will all land up fighting over the same single AMX unit anyway! So why complicate the code with Altivec by trying to achieve that. It will work (because of the abstraction of 4 sets of registers) but that's there to help "un-co-ordinated" code from different apps to work without forcing some sort of synchronization/allocation of the resource.
the AMX unit on the E-cluster does work, so why not use it? Well, it runs at a lower frequency and a different design with much less parallelization. So it can be used by code that, for whatever reason, both runs on the E-core and wants AMX. But trying to use that AMX unit along with the P AMX-unit is probably more trouble than it's worth. The speed differences are large enough to make it very difficult to ensure synchronization and appropriate balancing between the much faster P and the much slower E. I can't blame Apple for considering pursuing this a waste of time.

More details can be found here: https://gist.github.com/dougallj/7a75a3be1ec69ca550e7c36dc75e0d6f

It is certainly possible that Apple could change various aspects of this at any time, for example adding two AMX units to the P-cluster. Presumably when this happens, Accelerate will be updated appropriately.

Accelerate framework uses only one core on Mac M1

2 Answers2