0

I am trying to calculate the following using neon in assembly ((200*(53-255))/255) + 255 whose result should equal approx 97

I've tested here http://szeged.github.io/nevada/ and also on a dual-core Cortex-A7 ARM CPU tablet. And the result is 243 which is not correct.
How should I implement this to get the correct result of 97?

d2 contains 200,200,200,200,200,200,200,200
d4 contains 255,255,255,255,255,255,255,255
d6 contains 53,53,53,53,53,53,53,53

vsub.s8 d8, d6, d4  (53 - 255 results in d8 = 54,54,54,54,54,54,54,54)
vmull.s8 q5,d8,d2  (54 * 200 results in q5 = 244,48,244,48,244,48,244,48,244,48,244,48,244,48,244,48)
vshrn.s16 d12, q5, #8 (divide by 255 results in d12 = 244,244,244,244,244,244,244,244) 
vadd.s8 d5, d4, d12  (final result d5 = 243,243,243,243,243,243,243,243) 
bfalz
  • 92
  • 9

1 Answers1

1

243 is absolutely correct.

The alpha channel is an unsigned 8bit value, you should use u8 or u16 instead of s8 and s16.

While for standard arithmetics where the bit width remains the same the sign doesn't matter, it's a completely different story for multiply long.

And that's the reason there are two separate instructions for ARM UMULL and SMULL for long multiply while a single MUL instruction will do for 32bit both signed and unsigned multiplications.

54*200 is simply impossible since 200 is interpreted as -56 in a signed multiply.

=>
54*-56 = -3024
-3024/256 = -12
-12 + -1 = -13    // 255 = -1
-13 = 243

You actually have to change vmull.s8 to vmull.u8 :

=>
54*200 = 4800
2800/256 = 18
18 + -1 = 17

Honestly, I don't know how you are expecting a result of 97 with the ops above : how is it supposed to be some kind of alpha blending as one of the tags is implying?

Further, >>8 is NOT /255. It's just a bad approximation. You might think you can live with a precision that low, but it's FAR from sufficient when alpha blending.

You must be doing something wrong.

Jake 'Alquimista' LEE
  • 6,197
  • 2
  • 17
  • 25
  • the alpha blending formula I'm trying to implement in neon asm is output_red = ((alpha_front * (red_front - red_bak))/255) + red_bak and repeat for the blue and green pixels. I am expecting a result of 97 when red_front = 53, red_bak = 255, and alpha_front = 200. via a calculator: 53 - 255 = -202, -202 x 200 = -40,400, -40,400/255 = -158 and finally -158 + 255 = 97. I've already tried vmull.u8 and then the result is 41. With regards to the shift right or divide by 255, there are some references for using it here http://www.gamedev.net/topic/34688-alpha-blend-formula/ – bfalz Jul 27 '14 at 03:16
  • That's wrong. The correct formula is : rslt = (alpha*front + (255-alpha)*back)/255. – Jake 'Alquimista' LEE Jul 27 '14 at 03:22
  • (200*53+55*255)/255 = 96. The correct result for the given values is 96, 97 with rounding – Jake 'Alquimista' LEE Jul 27 '14 at 03:28
  • Thanks, and I understand from my research that there are a number of different formulas for alpha blending. I actually implemented the formula ((alpha*(front-back))/255)+back that I listed out in my question in python and it works very well @Jake. I wish I could send you the screenshot of how it looks because it looks very very good. However the implementation is very slow because it is implemented in python. I am trying to now implement this in neon asm. – bfalz Jul 27 '14 at 03:33
  • The formula you are using ignores two things : 1) It tries to reduce multiply in the hope of being faster. But modern CUPs like ARM nowdays have very fast multiply-accumulate instructions. Therefore, it does exactly the opposite. 2) It requires the full bit-width of register. It's simply not appropriate for SIMD, where saving bits increases performance massively. => You should use the formula I gave for NEON. – Jake 'Alquimista' LEE Jul 27 '14 at 03:42
  • thanks. I just tested out your formula in python and looks good. I am now porting this over to asm neon and will update shortly #Jakes alpha blend formula rslt = (alpha*front + (255-alpha)*back)/255 output_red = (alpha_front * red_front + (255-alpha_front) *red_bak)/255 output_green = (alpha_front * green_front + (255-alpha_front) *green_bak)/255 output_blue = (alpha_front * blue_front + (255-alpha_front) *blue_bak)/255 – bfalz Jul 27 '14 at 03:53
  • brilliant @Jake!! Thanks for the tips and pointers, its working great now -cheers vmull.u8 q5,d6,d2 vsub.u8 d8, d4, d2 vmull.u8 q6,d8,d4 vadd.u8 q6,q6,q5 vshrn.u16 d14, q6, #8 – bfalz Jul 27 '14 at 05:02