NEON Fixed point coding and Fixed vs Floating point operations performance comparison

Question

As we can see here "arm integer NEON operations cycles " and arm float NEON operations cycles ,the integer Multiply operations does not seem to have a definite advantage over the Floating point Multiplication operations. When I converted my floating point code to fixed point, I had to add additional "shift "instruction after fixed point multiplication/division instructions. The cycles required for the program actually increased due to increase in the instructions. The performance of my program deteriorated due to Fixed point. (14000 -cycles for floating point code, 26000-cycles for fixed point code).

Are there any special instructions dedicated NEON to fixed point operations(Multiplications and divisions) ? I only found one instruction that just converts Fixed -float and otherwise. Is there any efficient way of writing fixed point programs in NEON?

I wrote the following sample code for floating point code.

    VMUL   Q14.F32,Q8.F32,Q2.F32
    VMUL   Q15.F32,Q8.F32,Q3.F32
    VLD2    {Q10.F32,Q11.F32},[pTw2@256],TwdStep
    VLD2    {Q4.F32,Q5.F32},[pT1@256],fftSize
    VMLA   Q14.F32,Q9.F32,Q3.F32
    VMLS   Q15.F32,Q9.F32,Q2.F32

The following code was converted to Fixed point code by inserting shift operations after VMUL A instructions.

    VMUL   Q14.S32,Q8.S32,Q2.S32
   VMUL   Q15.S32,Q8.S32,Q3.S32
   VLD2    {Q10.S32,Q11.S32},[pTw2@256],TwdStep
   VLD2    {Q4.S32,Q5.S32},[pT1@256],fftSize
   VMLA   Q14.S32,Q9.S32,Q3.S32
   VMLS   Q15.S32,Q9.S32,Q2.S32

   VRSHR    Q14.S32,Q14.S32,#12     ;Shift instructions to account for fixed point 
   VRSHR    Q15.S32,Q15.S32,#12     ;

These days fixed point usually only makes sense on CPUs which are floating-point-challenged, such as low end micro-controllers, and on CPUs with explicit fixed point support (various DSP families, some SIMD architectures). Otherwise just use floating point. — Paul R, Apr 04 '13 at 16:23
You can gain advantage by combining pipelines if possible. Are you using 32bit values or is there SIMD going on? Just regular ARM has `MLA`, `MUL`, etc. which perform on 32bit values. you can do one floating calculation in the NEON core and another fixed with the ARM. — artless noise, Apr 04 '13 at 18:20
@artlessnoise I just wanted to see the capability of NEON.Doing in parallel really helps!! — Wolfrum, Apr 05 '13 at 04:32
Sorry, I am not familiar with NEON. The register are 64bit, so you are doing two operations at once. My point is you can do some calculations with the ARM integer unit while the NEON unit is also running code; it doesn't look like this will work for your **FFT**. — artless noise, Apr 05 '13 at 14:18

score 2 · Answer 1 · answered Apr 05 '13 at 07:39

See Vector Floating Point Instruction Set Quick Reference Card for the set of NEON instructions. There is no dedicated fixed point instructions.

I suggest you to read blog.arm.com post titled Coding for NEON - Part 3: Matrix Multiplication / Fixed Point, it may give you some ideas to try things.

It claims:

Using fixed point arithmetic for calculations is often faster than floating point – it requires less memory bandwidth to read and write values that use fewer bits, and multiplication of integer values is generally quicker than the same operations applied to floating point numbers.

However, when using fixed point arithmetic, you must choose the representation carefully to avoid overflow or saturation, whilst preserving the degree of precision in the results that your application requires.

In the example pointed above the " use fewer bits" is of importance, the example uses 32 bits for floating point, whereas it uses 16 bits for fixed point. In my case I used 32 bits both for floating and fixed point. — Wolfrum, Apr 05 '13 at 08:44

NEON Fixed point coding and Fixed vs Floating point operations performance comparison

1 Answers1