How to maximize application performance for ARM little.big architecture - MPI

Question

I am writing an MPI application to speedup a math algorithm with ARM cored device. The device has a S922X CPU which integrates a quad-core ARM Cortex-A73 cpu and a two core Cortex-A53 CPU.

I am wondering, with tuning of the compiler, or selecting a different compiler, can I expect more speedup for my application?

I was playing with possible options of the mpic++ compiler like -O1, -O3, -Ofast, -ffast-math -march=native ... etc.

The final option was this: -Wall -Wextra -std=c++11 -Ofast

And the build application could run on both cores. However they have different instruction sets so I think the binary is not maximized yet for performance.

the capabilities of the two cores are describe in the datasheet

Cortex-A53 processor features

Armv8 Architecture ARM, Thumb, and ThumbEE instruction set support
Media Processing Engine (MPE) with NEON technology

Cortex-A73 processor features

Armv8-A Architecture
NEON advanced SIMD
DSP & SIMD extensions
VFPv4 floating point
Supports Hardware virtualization

How can I use the powerful features of the A73 core to speedup more my application? What is the best approach?

By the way from my previous post I became enlightened I must use the BIG core if I want max performance:

C/C++ MPI speedup is not as expected

According to [wikipedia](https://en.wikipedia.org/wiki/List_of_ARM_microarchitectures), both A53 and A73 have practically the same set of features. You would just need _dynamic load balancing_ to utilize cores with different performance. BTW, do you need MPI? You write about "the device" — if it is only one, why not use threading instead? — Daniel Langr, Jan 08 '21 at 11:55
According to the documentation of [ARM](https://developer.arm.com/documentation/ddi0487/fc), the two core has differences. The A53 uses Armv8 instruction set while A73 is using Armv8-A. For example the A73 has Scalable Vector Extension (SVE) instruction set. And the datasheet of S922X writes different features for the cores. Anyway, I do not know my application is using the NEON instructions or any other special instructions. Later I would like to use more devices - this is the usecase of MPI. — D_Dog, Jan 08 '21 at 12:43
The cores are 'binary compatible'. https://en.wikipedia.org/wiki/Comparison_of_ARMv8-A_cores is a better reference. The Cortex-A73 is a deeper pipe with more execution ports. The DMIPS are 6.35 vs 2.24 as well as bigger cache and possibly a faster clock rate. You need to use '-mtune' with the core types to tell the compiler to make the most performant code. -march=native won't work (especially if you cross compile), it will pick a lower performance CPU. — artless noise, Jan 09 '21 at 21:24

score 2 · Accepted Answer · answered Jan 08 '21 at 14:31

Your problem is twofold.

First, there are cores with varying instruction sets. Most MPI implementations provide an easy solution for that by allowing you to run jobs from more than one executable. You simply need to compile the code twice with core-specific optimisations in order to produce two executable files. Let's call them prog.big (optimised for the big cores) and prog.little (optimised for the LITTLE cores). Then, instead of launching 6 ranks from a generic executable with mpiexec -n 6 ./prog, you launch 4 ranks from prog.big and 2 ranks from prog.little:

mpiexec -n 4 ./prog.bin : -n 2 ./prog.little

That's not enough though. You need to place the right process on the right core. Doing so is very implementation-specific. In the simplest case, you can tell MPI to pin/bind each MPI rank to a single logical CPU and do so in a linear fashion, i.e., rank 0 gets bound to core 0, rank 1 to core 1, etc. and hope that the OS will map the big cores to logical CPUs 0 to 3 and the LITTLE cores to logical CPUs 4 and 5. If that is not the case, you may need to perform some additional acrobatics. For example, Open MPI allows you to specify a rankfile with --rankfile filename, in which you can provide a rank to CPU mapping:

rank 0=localhost slot=0
rank 1=localhost slot=1
rank 2=localhost slot=2
...

Having optimised executable files and properly placed processes is only half of the solution. The rest is to actually have a parallel algorithm that can make use of CPUs with different speeds. If you have a globally synchronous algorithm, for example one solving PDEs, or anything iterative in general, then the computation time of a single step is that of the slowest MPI rank. If you give the same amount of work to the big and to the LITTLE cores, the latter will lag significantly and the former will have to wait, wasting computational time. So you need to either perform some advanced domain decomposition and give smaller work items to the slower cores or use an approach such as "bag of work" (a.k.a. controller/worker) and have each worker rank request a piece of data to work on. In this case, faster cores will process more items and the work will balance itself automatically.

How to maximize application performance for ARM little.big architecture - MPI

1 Answers1