neon float multiplication is slower than expected

Question

I have two tabs of floats. I need to multiply elements from the first tab by corresponding elements from the second tab and store the result in a third tab.

I would like to use NEON to parallelize floats multiplications: four float multiplications simultaneously instead of one.

I have expected significant acceleration but I achieved only about 20% execution time reduction. This is my code:

#include <stdlib.h>
#include <iostream>
#include <arm_neon.h>

const int n = 100; // table size

/* fill a tab with random floats */
void rand_tab(float *t) {
    for (int i = 0; i < n; i++)
        t[i] = (float)rand()/(float)RAND_MAX;
}

/* Multiply elements of two tabs and store results in third tab
 - STANDARD processing. */
void mul_tab_standard(float *t1, float *t2, float *tr) {
    for (int i = 0; i < n; i++)
         tr[i] = t1[i] * t2[i]; 
}

/* Multiply elements of two tabs and store results in third tab 
- NEON processing. */
void mul_tab_neon(float *t1, float *t2, float *tr) {
    for (int i = 0; i < n; i+=4)
        vst1q_f32(tr+i, vmulq_f32(vld1q_f32(t1+i), vld1q_f32(t2+i)));
}

int main() {
    float t1[n], t2[n], tr[n];

    /* fill tables with random values */
    srand(1); rand_tab(t1); rand_tab(t2);


    // I repeat table multiplication function 1000000 times for measuring purposes:
    for (int k=0; k < 1000000; k++)
        mul_tab_standard(t1, t2, tr);  // switch to next line for comparison:
    //mul_tab_neon(t1, t2, tr);  
    return 1;
}

I run the following command to compile: g++ -mfpu=neon -ffast-math neon_test.cpp

My CPU: ARMv7 Processor rev 0 (v7l)

Do you have any ideas how I can achieve more significant speed-up?

Having searched Google for your functions `vst1q_f32` and `vmulq_f32` and I cannot find much info about them. Can you provide a link with docs? — Tony The Lion, Sep 14 '12 at 07:51
These functions are listed here: http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html I have not found detailed description of them. They produce right arithmetical results. — tomto, Sep 14 '12 at 08:06
You need to add `-O3` to your `g++` command line. BTW, I don't recommend `--fast-math`. So: `g++ -Wall -O3 -mfpu=neon neon_test.cpp`. — Paul R, Sep 14 '12 at 08:36

score 5 · Answer 1 · answered Sep 14 '12 at 08:49

Cortex-A8 and Cortex-A9 can do only two SP FP multiplications per cycle, so you may at most double the performance on those (most popular) CPUs. In practice, ARM CPUs have very low IPC, so it is preferably to unroll the loops as much as possible. If you want ultimate performance, write in assembly: gcc's code generator for ARM is nowhere as good as for x86.

I also recommend to use CPU-specific optimization options: "-O3 -mcpu=cortex-a9 -march=armv7-a -mtune=cortex-a9 -mfpu=neon -mthumb" for Cortex-A9; for Cortex-A15, Cortex-A8 and Cortex-A5 replace -mcpu=-mtune=cortex-a15/a8/a5 accordingly. gcc does not have optimizations for Qualcomm CPUs, so for Qualcomm Scorpion use Cortex-A8 parameters (and also unroll even more than you usually do), and for Qualcomm Krait try Cortex-A15 parameters (you will need a recent version of gcc which supports it).

"Cortex-A8 and Cortex-A9 can do only two SP FP multiplications per cycle" - I haven't known that. This is the main reason of my unsatisfactory spped-up. Thanks! I have experimented with the optimalization options but it hasn't helped. — tomto, Sep 14 '12 at 09:21

score 2 · Answer 2 · answered Sep 30 '12 at 20:39

One shortcoming with neon intrinsics, you can't use auto increment on loads, which shows up as extra instructions with your neon implementation.

Compiled with gcc version 4.4.3 and options -c -std=c99 -mfpu=neon -O3 and dumped with objdump, this is loop part of mul_tab_neon

000000a4 <mul_tab_neon>:
  ac:   e0805003    add r5, r0, r3
  b0:   e0814003    add r4, r1, r3
  b4:   e082c003    add ip, r2, r3
  b8:   e2833010    add r3, r3, #16
  bc:   f4650a8f    vld1.32 {d16-d17}, [r5]
  c0:   f4642a8f    vld1.32 {d18-d19}, [r4]
  c4:   e3530e19    cmp r3, #400    ; 0x190
  c8:   f3400df2    vmul.f32    q8, q8, q9
  cc:   f44c0a8f    vst1.32 {d16-d17}, [ip]
  d0:   1afffff5    bne ac <mul_tab_neon+0x8>

and this is loop part of mul_tab_standard

00000000 <mul_tab_standard>:
  58:   ecf01b02    vldmia  r0!, {d17}
  5c:   ecf10b02    vldmia  r1!, {d16}
  60:   f3410db0    vmul.f32    d16, d17, d16
  64:   ece20b02    vstmia  r2!, {d16}
  68:   e1520003    cmp r2, r3
  6c:   1afffff9    bne 58 <mul_tab_standard+0x58>

As you can see in standard case, compiler creates much tighter loop.

neon float multiplication is slower than expected

2 Answers2

Linked