Questions tagged [neon]

NEON is a vector-processing instruction set for ARM processors. Please use this tag together with [arm] if asking about the AArch32 version of NEON (to run on 32-bit ARM processors), or [arm64] for AArch64. See also the [simd] tag.

NEON is a vector-processing instruction set for ARM processors. It's also known as Advanced SIMD (Single Instruction Multiple Data).

NEON can be used on either 32-bit or 64-bit ARM processors, as part of the AArch32 or AArch64 architectures respectively. However, there are significant differences between the AArch32 and AArch64 versions of NEON (register usage, instruction mnemonics, instruction availability), so please use this tag together with either for AArch32, or for AArch64.

The tag may also be appropriate, especially for questions about SIMD algorithms that may be implemented with NEON.

Don't forget to include a tag for the programming language you are coding in, perhaps , or . In the latter cases, consider the tags or for how you access the instructions.

More information at

  1. Neon page in ARM website
  2. Wikipedia article on ARM
885 questions
5
votes
2 answers

ARM NEON: What's the difference between vld4_f32 and vld4q_f32?

I'm not in a position to make out the difference between vld4_f32 and vld4q_f32 in ARM NEON instructions. The confusion started when I raised my coding levels and started looking at the assembly instructions rather than the less informative…
HaggarTheHorrible
  • 7,083
  • 20
  • 70
  • 81
5
votes
1 answer

How to merge elements of 2 rows using NEON SIMD?

I have a A = a1 a2 a3 a4 b1 b2 b3 b4 c1 c2 c3 c4 d1 d2 d3 d4 I have 2 rows with me, float32x2_t a = a1 a2 float32x2_t b = b1 b2 From these how can I get a - float32x4_t result = b1 a1 b2 a2 Is there any single NEON SIMD instruction…
HaggarTheHorrible
  • 7,083
  • 20
  • 70
  • 81
5
votes
3 answers

Is 3x3 Matrix inverse possible using SIMD instructions?

I'm making use of an ARM Cortex-A8 based processor and I have several places where I calculate 3x3 Matrix inverse operations. As the Cortex-a8 processor has a NEON SIMD processor I'm interested to use this co-processor for 3x3 matrix inverse, I saw…
HaggarTheHorrible
  • 7,083
  • 20
  • 70
  • 81
5
votes
1 answer

Is there a C implementation for GNU ARM NEON intrinsics?

I'm not looking for a portable SIMD implementation. All I need is: a bit-accurate implementation. Performance doesn't matter very much as long as it's not extremely slow. I want to use it for early stage developing and testing, so that I can compile…
user3528438
  • 2,737
  • 2
  • 23
  • 42
5
votes
1 answer

determinant calculation with SIMD

Does there exist an approach for calculating the determinant of matrices with low dimensions (about 4), that works well with SIMD (neon, SSE, SSE2)? I am using a hand-expansion formula, which does not work so well. I am using SSE all the way to…
user1095108
  • 14,119
  • 9
  • 58
  • 116
5
votes
2 answers

Translating SSE to Neon: How to pack and then extract 32bit result

I have to translate the following instructions from SSE to Neon uint32_t a = _mm_cvtsi128_si32(_mm_shuffle_epi8(a,SHUFFLE_MASK) ); Where: static const __m128i SHUFFLE_MASK = _mm_setr_epi8(3, 7, 11, 15, -1, -1, -1, -1, …
Antonio
  • 19,451
  • 13
  • 99
  • 197
5
votes
0 answers

ARM NEON: Sort an array of 16 bytes

tl;dr: What is the fastest way to sort an uint8x16_t? I need to sort many arrays of exactly 16 unsigned bytes (in descending order, which doesn't matter, of course), and i'm trying to optimize sorting by means of ARM NEON vectorization. And i find…
NoQ
  • 75
  • 6
5
votes
1 answer

Optimizing Cortex-A8 color conversion using NEON

I am currently doing a color conversion routine in order to convert from YUY2 to NV12. I have a function which is quite fast, but not as fast as I would expect, mainly due to cache misses. void convert_hd(uint8_t *orig, uint8_t *result) { uint32_t…
jmh
  • 51
  • 2
5
votes
2 answers

How to use ARM intrinsics in iOS?

I need to compute MSB (most significant bit) on millions of 32-bit integers on iPad very fast. I have my own (ugly) implementation of MSB written on plain C, which is slow. ARM processors have CLZ (count leading zeroes) hardware command, which can…
Alexander Vasenin
  • 11,437
  • 4
  • 42
  • 70
5
votes
3 answers

Resize 8-bit image by 2 with ARM NEON

I have an 8-bit 640x480 image that I would like to shrink to a 320x240 image: void reducebytwo(uint8_t *dst, uint8_t *src) //src is 640x480, dst is 320x240 What would be the best way to do that using ARM SIMD NEON? Any sample code somewhere? As a…
gregoiregentil
  • 1,793
  • 1
  • 26
  • 56
5
votes
3 answers

Is NEON of ARM faster for integers than floating points?

Or both floating point and integer operations are same speed? And if not so, how much faster is the integer version?
MetallicPriest
  • 29,191
  • 52
  • 200
  • 356
5
votes
2 answers

ARM NEON SIMD version 2

What is the difference between NEON SIMD and NEON SIMD version 2 as in Cortex A15?
user1511956
  • 784
  • 3
  • 9
  • 22
5
votes
2 answers

ARM NEON vectorization failure

I would like to enable NEON vectorization on my ARM cortex-a9, but I get this output at compile: "not vectorized: relevant stmt not supported: D.14140_82 = D.14143_77 * D.14141_81" Here is my loop: void my_mul(float32_t * __restrict data1, float32_t…
user2092113
  • 103
  • 1
  • 5
5
votes
2 answers

ARM NEON assembly on Windows Phone 8 not working

I'm trying to call a function that is coded in ARM NEON assembly in an .s file that looks like this: AREA myfunction, code, readonly, ARM global fun align 4 fun push {r4, r5, r6, r7, lr} add r7, sp, #12 push {r8, r10, r11} sub r4,…
Anthony Blake
  • 5,328
  • 2
  • 25
  • 24
5
votes
2 answers

Using ARM NEON intrinsics to add alpha and permute

I'm developing an iOS app that needs to convert images from RGB -> BGRA fairly quickly. I would like to use NEON intrinsics if possible. Is there a faster way than simply assigning the components? void neonPermuteRGBtoBGRA(unsigned char* src,…
Nick Lee
  • 127
  • 2
  • 9