How can I optimize a looped 4D matrix-vector-multiplication with ARM NEON?

Question

I'm working on optimizing a 4D (128 Bit) matrix-vector multiplication using ARM NEON Assembler.

If I load the matrix, and the vector into the NEON Registers and transform it, I won't get a great performance boost, because the switch to the NEON Registers cost 20 cycles. Furthermore I reload the matrix for each multiplication, despite it has not changed.

There is enough register-space to perform the transformation on more vectors a time. This IS increasing performance.

But..

I'm wondering how fast this operation would be if I do the loop over all vertices (increasing pointers) within the assembler. But I am at the very beginning of Neon assembler and though don't know how to do this. Can someone give me an hand on that?

What I want to achieve:

load matrix and first vector
store loop count "count" and..
-- LOOP_START --
perform multiply-adds (do the Transformation)
write q0 to vOut
increase pointers vIn and vOut by 4 (128 Bit)
LOAD vIn to q5.
-- LOOP_END --

existing C-Version of loop:

void TransformVertices(ESMatrix* m, GLfloat* vertices, GLfloat* normals, int count)
{
    GLfloat* pVertex = vertices;
    int i;  

    // iterate trough vertices only one at a time
    for (i = 0; i < count ; i ++)
    {
        Matrix4Vector4Mul( (float *)m, (float *)pVertex, (float *)pVertex);
        pVertex += 4;
    }

    //LoadMatrix( (const float*) m);

    //// two at a time
    //for (i = 0; i < count ; i += 2)
    //{
    //    Matrix4Vector4Mul2( (float *)m, (float *)pVertex, (float *)(pVertex + 4));
    //      pVertex += 8;
    //}
}

Following code for NEON-Version on doing only one transformation:

void Matrix4Vector4Mul (const float* m, const float* vIn, float* vOut)
{    
    asm volatile
    (

    "vldmia %1, {q1-q4 }     \n\t"
    "vldmia %2, {q5}         \n\t"

    "vmul.f32 q0, q1, d10[0] \n\t"        
    "vmla.f32 q0, q2, d10[1] \n\t"      
    "vmla.f32 q0, q3, d11[0] \n\t"        
    "vmla.f32 q0, q4, d11[1] \n\t"

    "vstmia %0, {q0}"

    : // no output
    : "r" (vOut), "r" (m), "r" (vIn)       
    : "memory", "q0", "q1", "q2", "q3", "q4", "q5" 
    );

}

C-Version of transformation:

void Matrix4Vector4Mul (const float* m, const float* vIn, float* vOut)
{
    Vertex4D* v1 =    (Vertex4D*)vIn;
    Vertex4D vOut1;
    Vertex4D* l0;
    Vertex4D* l1;
    Vertex4D* l2;
    Vertex4D* l3;

    // 4x4 Matrix with members m00 - m33 
    ESMatrix* m1 = (ESMatrix*)m;

    l0 = (Vertex4D*)&m1->m00;
    vOut1.x = l0->x * v1->x;
    vOut1.y = l0->y * v1->x;
    vOut1.z = l0->z * v1->x;
    vOut1.w = l0->w * v1->x;

    l1 = (Vertex4D*)&m1->m10;
    vOut1.x += l1->x * v1->y;
    vOut1.y += l1->y * v1->y;
    vOut1.z += l1->z * v1->y;
    vOut1.w += l1->w * v1->y;

    l2 = (Vertex4D*)&m1->m20;
    vOut1.x += l2->x * v1->z;
    vOut1.y += l2->y * v1->z;
    vOut1.z += l2->z * v1->z;
    vOut1.w += l2->w * v1->z;

    l3 = (Vertex4D*)&m1->m30;
    vOut1.x += l3->x * v1->w;
    vOut1.y += l3->y * v1->w;
    vOut1.z += l3->z * v1->w;
    vOut1.w += l3->w * v1->w;

    *(vOut) = vOut1.x;
    *(vOut + 1) = vOut1.y;
    *(vOut + 2) = vOut1.z;
    *(vOut + 3) = vOut1.w;
}

Performance: (Transform > 90 000 Vertices | Android 4.0.4 SGS II)

C-Version:    190 FPS 
NEON-Version: 162 FPS ( .. slower -.- )

--- LOAD Matrix only ONCE (seperate ASM) and then perform two V's at a time ---

NEON-Version: 217 FPS ( + 33 % NEON | + 14 % C-Code )

Provide Matrix4Vector4Mul as well, in fact just make them one loop, as you would write it in plain c. — auselen, Oct 19 '12 at 14:59
... i provided all code for the operation now. It is a loop with just the transforming-function. Hope i got you right? — oc1d, Oct 19 '12 at 15:18

auselen · Answer 1 · 2012-10-21T07:02:27.863

Did you try playing with compiler flags?

-mcpu=cortex-a9 -mtune=cortex-a9 -mfloat-abi=softfp -mfpu=neon -O3

does pretty job for me in this case (gcc 4.4.3, distributed with Android NDK 8b). Try to have tight source code by defining internal functions static and inline as well as moving matrix (m[X][0] stuff) to static global variables or just merge Matrix4Vector4Mul into loop and make matrix local variables instead of keep passing it in function - gcc doesn't get smart there.

When I do this, I get below for the main loop.

  a4:   ed567a03    vldr    s15, [r6, #-12]
  a8:   ee276aa0    vmul.f32    s12, s15, s1
  ac:   ee676aa8    vmul.f32    s13, s15, s17
  b0:   ed564a04    vldr    s9, [r6, #-16]
  b4:   ee277a88    vmul.f32    s14, s15, s16
  b8:   ed165a02    vldr    s10, [r6, #-8]
  bc:   ee677a80    vmul.f32    s15, s15, s0
  c0:   ed565a01    vldr    s11, [r6, #-4]
  c4:   e2833001    add r3, r3, #1
  c8:   ee046a89    vmla.f32    s12, s9, s18
  cc:   e1530004    cmp r3, r4
  d0:   ee446aaa    vmla.f32    s13, s9, s21
  d4:   ee047a8a    vmla.f32    s14, s9, s20
  d8:   ee447aa9    vmla.f32    s15, s9, s19
  dc:   ee056a22    vmla.f32    s12, s10, s5
  e0:   ee456a01    vmla.f32    s13, s10, s2
  e4:   ee057a21    vmla.f32    s14, s10, s3
  e8:   ee457a02    vmla.f32    s15, s10, s4
  ec:   ee056a8b    vmla.f32    s12, s11, s22
  f0:   ee456a83    vmla.f32    s13, s11, s6
  f4:   ee057aa3    vmla.f32    s14, s11, s7
  f8:   ee457a84    vmla.f32    s15, s11, s8
  fc:   ed066a01    vstr    s12, [r6, #-4]
 100:   ed466a04    vstr    s13, [r6, #-16]
 104:   ed067a03    vstr    s14, [r6, #-12]
 108:   ed467a02    vstr    s15, [r6, #-8]
 10c:   e2866010    add r6, r6, #16
 110:   1affffe3    bne a4 <TransformVertices+0xa4>

Having 4 loads, 4 multiplies, 12 multiply and accumulates and 4 stores which matches with what you are doing in Matrix4Vector4Mul.

If you are still not satisfied with compiler generated code, pass compiler '-S' to get assembly output and use that as a starting point to improve further instead of starting from scratch.

You should also check that vertices is cache line size aligned (32 bytes for Cortex-A9) to get a nice data flow.

For vectorization there are gcc options like -ftree-vectorizer-verbose=9 to print information what was vectorized. Also search in gcc documentation this one to see how you can direct gcc or what you need to modify to get your multiplications vectorized. This might sound a lot to dig in but it would be more fruitful for you in long run than 'hand vectorizing'.

Thanks! :) I will try out the compilerflags and post result in this thread. But the compileroutput seems to work on single-precision floating-points, while it could work also on quads with NEON. Cortex A9 executes 64 Bit at a time (and that will be 8 cycles), but with A15 there is 128 bit execution with single cycle. That will be **4** vs. **16** instructions for the transformation. One says that handwritten ASM isn't worth the time, since A9 and A15 are much more advanced than A8, but i couldnt find a proof for that. — oc1d, Oct 21 '12 at 10:50
That's objdump, use arm-linux-androideabi-objdump -S . gcc -S creates a code that can be assembled later, objdump output can't. That's why I mentioned in that way above. — auselen, Oct 22 '12 at 09:56

Aki Suihkonen · Answer 2 · 2012-10-22T05:13:23.130

The hand tuned neon version suffers from dependency between all of the operations, while gcc is able to do out-of-order scheduling for the c-version. You should be able to improve the NEON version by calculating in parallel two or more independent threads:

Pointer increment (post increment) in NEON is done with exclamation mark. Those registers then should be included in the output register list "=r" (vOut)

vld1.32 {d0,d1}, [%2]!   ; // next round %2=%2 + 16 
vst1.32 {d0},    [%3]!   ; // next round %3=%3 + 8

Another addressing mode allows post increment by a "stride" defined in another arm register. The option is available only on some load commands (as there are a variety of interleaving options as well as loading to chosen elements of say d1[1] (upper part)).

vld1.16 d0, [%2], %3    ; // increment by register %3

The counter increment happens with sequence

1: subs %3, %3, #1      ; // with "=r" (count) as fourth argument
bne 1b                  ; // create a local label

Local label is used, as two "bne loop" statements in the same file causes an error

One should be able to increase parallelism by a factor of four by calculating fused multiply adds for vectors instead of single elements.

In this case it's worthwhile to perform a matrix transpose in advance (either before calling the routine or with special addressing mode).

asm(
   "vld1.32 {d0[0],d2[0],d4[0],d6[0]}, [%0]! \n\t"
   "vld1.32 {d0[1],d2[1],d4[1],d6[1]}, [%0]! \n\t"
   "vld1.32 {d1[0],d3[0],d5[0],d7[0]}, [%0]! \n\t"
   "vld1.32 {d1[1],d3[1],d5[1],d7[1]}, [%0]! \n\t"

   "vld1.32 {q8}, [%2:128]! \n\t"
   "vld1.32 {q9}, [%2:128]! \n\t"
   "vld1.32 {q10}, [%2:128]! \n\t"
   "vld1.32 {q11}, [%2:128]! \n\t"

   "subs %0, %0, %0 \n\t"   // set zero flag

   "1: \n\t"
   "vst1.32 {q4}, [%1:128]! \n\t"
   "vmul.f32 q4, q8, q0 \n\t"
   "vst1.32 {q5}, [%1:128]! \n\t"
   "vmul.f32 q5, q9, q0 \n\t"
   "vst1.32 {q6}, [%1:128]! \n\t"
   "vmul.f32 q6, q10, q0 \n\t"
   "vst1.32 {q7}, [%1:128]!  \n\t"
   "vmul.f32 q7, q11, q0 \n\t"

   "subne %1,%1, #64    \n\t"    // revert writing pointer in 1st iteration 

   "vmla.f32 q4, q8, q1 \n\t"
   "vmla.f32 q5, q9, q1 \n\t"
   "vmla.f32 q6, q10, q1 \n\t"
   "vmla.f32 q7, q11, q1 \n\t"
   "subs %2, %2, #1 \n\t"
   "vmla.f32 q4, q8, q2 \n\t"
   "vmla.f32 q5, q9, q2 \n\t"
   "vmla.f32 q6, q10, q2 \n\t"
   "vmla.f32 q7, q11, q2 \n\t"

   "vmla.f32 q4, q8, q3 \n\t"
   "vld1.32 {q8}, [%2:128]! \n\t"  // start loading vectors immediately
   "vmla.f32 q5, q9, q3 \n\t"
   "vld1.32 {q9}, [%2:128]! \n\t"  // when all arithmetic is done
   "vmla.f32 q6, q10, q3 \n\t"
   "vld1.32 {q10}, [%2:128]! \n\t"
   "vmla.f32 q7, q11, q3 \n\t"
   "vld1.32 {q11}, [%2:128]! \n\t"
   "jnz b1 \n\t"
   "vst1.32 {q4,q5}, [%1:128]! \n\t"  // write after first loop
   "vst1.32 {q6,q7}, [%1:128]! \n\t"
 : "=r" (m), "=r" (vOut), "=r" (vIn), "=r" ( N ), 
 :
 : "d0","d1","q0", ... ); // marking q0 isn't enough for some gcc version

Read and write to 128 bit aligned blocks (make sure the data ptr is aligned too)
there's a malloc with align, or just adjust manually ptr=((int)ptr + 15) & ~15.

Just as there is a post loop block writing the results, one can write a similar pre loop block that skips the first write of nonsense to vOut (that could also be overcome by conditional write). One can unfortunately only write 64 bit registers conditionally.

nice! :) Fused Multiply-Adds are only supported in VFPv4.. i think cortex A15 (and optional). — oc1d, Oct 22 '12 at 08:50
Sorry, couldn't remember that. I had a vague recollection of writing something like that in A8. But there are anyway 4 registers left in the register bank (q12-q15) to multiply the intermediate result before adding to q4-q7. Also there's a possibility of going for fixed point implementation and use the multiply accumulate that is available. — Aki Suihkonen, Oct 22 '12 at 09:32
instead of `"subs %2, %2, #1 \n\t"` its `"subs %3, %3, #1 \n\t"`, right? — oc1d, Oct 24 '12 at 10:04

score 0 · Answer 3 · answered Aug 29 '13 at 07:56

It's an almost one full year old topic by now, but I think it's important to give you the "correct" answer since something is very fishy here, and noone has pointed this out so far :

You should avoid using q4-q7 if possible since they have to be preserved prior to use
Correct me if I'm wrong on this, but if my memory isn't failing me, only d0~d3 (or d0~d7) can hold scalars. I'm really wondering why gcc is tolerating d10 and d11 as scalar operands. Since it's physically impossible that way, I guess gcc is again doing something crazy with your inline assembly. Check out the disassembly of your inline assembly code.

True, your inline assembly code suffers from two interlocks (2cycles after load and 9 cycles before store), but it's unimaginable for me that the NEON code runs slower than the C code.

It's a really strong guess from my side that gcc does some heavy register transferring back and forth instead spitting out an error message. And it isn't exactly doing a favor in this case.

How can I optimize a looped 4D matrix-vector-multiplication with ARM NEON?

3 Answers3