How to use the multiply and accumulate intrinsics in ARM Cortex-a8?

Question

how to use the Multiply-Accumulate intrinsics provided by GCC?

float32x4_t vmlaq_f32 (float32x4_t , float32x4_t , float32x4_t);

Can anyone explain what three parameters I have to pass to this function. I mean the Source and destination registers and what the function returns?

Help!!!

The GCC docs (and the RealView docs for the intrinsics that that the GCC intrinsics appear to be based on) are pretty sparse... if you don't get a decent answer, I'd suggest just compiling a few calls and taking a look at the assembly thats output. That should give you a pretty good idea (even if it's a less than ideal way to go). — Michael Burr, Jul 13 '10 at 19:10

Nils Pipenbrinck · Accepted Answer · 2010-07-13T20:17:51.387

22

Simply said the vmla instruction does the following:

struct 
{
  float val[4];
} float32x4_t


float32x4_t vmla (float32x4_t a, float32x4_t b, float32x4_t c)
{
  float32x4 result;

  for (int i=0; i<4; i++)
  {
    result.val[i] =  b.val[i]*c.val[i]+a.val[i];
  }

  return result;
}

And all this compiles into a singe assembler instruction :-)

You can use this NEON-assembler intrinsic among other things in typical 4x4 matrix multiplications for 3D-graphics like this:

float32x4_t transform (float32x4_t * matrix, float32x4_t vector)
{
  /* in a perfect world this code would compile into just four instructions */
  float32x4_t result;

  result = vml (matrix[0], vector);
  result = vmla (result, matrix[1], vector);
  result = vmla (result, matrix[2], vector);
  result = vmla (result, matrix[3], vector);

  return result;
}

This saves a couple of cycles because you don't have to add the results after multiplication. The addition is so often used that multiply-accumulates hsa become mainstream these days (even x86 has added them in some recent SSE instruction set).

Also worth mentioning: Multiply-accumulate operations like this are very common in linear algebra and DSP (digital signal processing) applications. ARM was very smart and implemented a fast-path inside the Cortex-A8 NEON-Core. This fast-path kicks in if the first argument (the accumulator) of a VMLA instruction is the result of a preceding VML or VMLA instruction. I could go into detail but in a nutshell such an instruction series runs four times faster than a VML / VADD / VML / VADD series.

Take a look at my simple matrix-multiply: I did exactly that. Due to this fast-path it will run roughly four times faster than implementation written using VML and ADD instead of VMLA.

edited Jul 13 '10 at 20:17

answered Jul 13 '10 at 20:11

Nils Pipenbrinck

83,631
31
151
221

Thank you for such a detailed reply. Your reply not just explains the instruction's functionality but also the pros and cons for using this instruction. – HaggarTheHorrible Jul 14 '10 at 05:30
Hi Nils, I understood how the matrix multiplication can be sped up using the NEON instructions. Its really addictive now :) I want to use the NEON instructions to do inverse of a matrix, can you point me to some good documents which explain how to use NEON instructions to do the inverse a matrix or can you give me any ideas, how to go about that? Thank you. – HaggarTheHorrible Jul 14 '10 at 08:31
1

for matrix inverse I'd do a google search on "sse matrix inverse" and port the sse code to NEON. The usual way is to calculate the inverse for small matrices (4x4) is via Cramers rule. – Nils Pipenbrinck Jul 14 '10 at 08:33
Nils can you please take a look at this related question of mine? Also can you please compile my example code I have posted there and tell me if the compiler is able to generate NEON SIMD instructions for the matrix multiplication? Thank you. [http://stackoverflow.com/questions/3307821/how-to-verify-vectorization-for-eigen-the-c-template-library-for-linear-algebr] – HaggarTheHorrible Jul 22 '10 at 11:35
1

Great answer. Just wanted to add a note for vikramtheone and others to make sure you really need the matrix inverse. Often the pseudoinverse will do, and finding that is a faster and more stable computation. – Robert Calhoun Jun 21 '11 at 18:49
... even more often, a clever transpose or simple algorithm will do. e.g. when there are no scale components and its a standard transformation matrix. http://jheriko-rtw.blogspot.co.uk/2011/01/fast-matrix-inversion-revisited.html – jheriko Jul 24 '12 at 09:03
I think your C implementation of vlma is wrong. There is no separate result register. vlma instruction is specified as "VMLA.F32 q0,q0,q0" – auselen Oct 17 '12 at 22:30

score 9 · Answer 2 · answered Jul 13 '10 at 19:16

Google'd for vmlaq_f32, turned up the reference for the RVCT compiler tools. Here's what it says:

Vector multiply accumulate: vmla -> Vr[i] := Va[i] + Vb[i] * Vc[i]
...
float32x4_t vmlaq_f32 (float32x4_t a, float32x4_t b, float32x4_t c);

AND

The following types are defined to represent vectors. NEON vector data types are named according to the following pattern: <type><size>x<number of lanes>_t For example, int16x4_t is a vector containing four lanes each containing a signed 16-bit integer. Table E.1 lists the vector data types.

IOW, the return value from the function will be a vector containing 4 32-bit floats, and each element of the vector is calculated by multiplying the corresponding elements of b and c, and adding the contents of a.

HTH

score 1 · Answer 3 · answered Feb 29 '12 at 12:33

result = vml (matrix[0], vector);
result = vmla (result, matrix[1], vector);
result = vmla (result, matrix[2], vector);
result = vmla (result, matrix[3], vector);

This sequence won't work, though. The problem is that x component accumulates only x modulated by the matrix rows and can be expressed as:

result.x = vector.x * (matrix[0][0] + matrix[1][0] + matrix[2][0] + matrix[3][0]);

...

The correct sequence would be:

result = vml (matrix[0], vector.xxxx);
result = vmla(result, matrix[1], vector.yyyy);

...

NEON and SSE don't have built-in selection for the fields (this would require 8 bits in instruction incoding, per vector register). GLSL/HLSL for example does have this kind of facilities so most GPUs have also.

Alternative way to achieve this would be:

result.x = dp4(vector, matrix[0]);
result.y = dp4(vector, matrix[1]);

... // and of course, the matrix would be transpose for this to yield same result

The mul,madd,madd,madd sequence is usually preferred as it does not require write mask for the target register fields.

Otherwise the code looks good. =)

How to use the multiply and accumulate intrinsics in ARM Cortex-a8?

3 Answers3

Linked