I am writing an MPI application to speedup a math algorithm with ARM cored device. The device has a S922X CPU which integrates a quad-core ARM Cortex-A73 cpu and a two core Cortex-A53 CPU.
I am wondering, with tuning of the compiler, or selecting a different compiler, can I expect more speedup for my application?
I was playing with possible options of the mpic++ compiler like -O1, -O3, -Ofast, -ffast-math -march=native ... etc.
The final option was this: -Wall -Wextra -std=c++11 -Ofast
And the build application could run on both cores. However they have different instruction sets so I think the binary is not maximized yet for performance.
the capabilities of the two cores are describe in the datasheet
Cortex-A53 processor features
- Armv8 Architecture ARM, Thumb, and ThumbEE instruction set support
- Media Processing Engine (MPE) with NEON technology
Cortex-A73 processor features
- Armv8-A Architecture
- NEON advanced SIMD
- DSP & SIMD extensions
- VFPv4 floating point
- Supports Hardware virtualization
How can I use the powerful features of the A73 core to speedup more my application? What is the best approach?
By the way from my previous post I became enlightened I must use the BIG core if I want max performance: