Questions tagged [neon]

NEON is a vector-processing instruction set for ARM processors. Please use this tag together with [arm] if asking about the AArch32 version of NEON (to run on 32-bit ARM processors), or [arm64] for AArch64. See also the [simd] tag.

NEON is a vector-processing instruction set for ARM processors. It's also known as Advanced SIMD (Single Instruction Multiple Data).

NEON can be used on either 32-bit or 64-bit ARM processors, as part of the AArch32 or AArch64 architectures respectively. However, there are significant differences between the AArch32 and AArch64 versions of NEON (register usage, instruction mnemonics, instruction availability), so please use this tag together with either arm for AArch32, or arm64 for AArch64.

The simd tag may also be appropriate, especially for questions about SIMD algorithms that may be implemented with NEON.

Don't forget to include a tag for the programming language you are coding in, perhaps assembly, c or c++. In the latter cases, consider the tags intrinsics or inline-assembly for how you access the instructions.

More information at

885 questions

votes

2 answers

ARM NEON: What's the difference between vld4_f32 and vld4q_f32?

I'm not in a position to make out the difference between vld4_f32 and vld4q_f32 in ARM NEON instructions. The confusion started when I raised my coding levels and started looking at the assembly instructions rather than the less informative…

memory assembly arm neon cpu-registers

asked Sep 29 '10 at 08:07

HaggarTheHorrible

7,083
20
70
81

votes

1 answer

How to merge elements of 2 rows using NEON SIMD?

I have a A = a1 a2 a3 a4 b1 b2 b3 b4 c1 c2 c3 c4 d1 d2 d3 d4 I have 2 rows with me, float32x2_t a = a1 a2 float32x2_t b = b1 b2 From these how can I get a - float32x4_t result = b1 a1 b2 a2 Is there any single NEON SIMD instruction…

simd intrinsics neon

asked Jul 27 '10 at 11:59

HaggarTheHorrible

7,083
20
70
81

votes

3 answers

Is 3x3 Matrix inverse possible using SIMD instructions?

I'm making use of an ARM Cortex-A8 based processor and I have several places where I calculate 3x3 Matrix inverse operations. As the Cortex-a8 processor has a NEON SIMD processor I'm interested to use this co-processor for 3x3 matrix inverse, I saw…

algorithm simd neon matrix-inverse

asked Jul 26 '10 at 10:58

HaggarTheHorrible

7,083
20
70
81

votes

1 answer

Is there a C implementation for GNU ARM NEON intrinsics?

I'm not looking for a portable SIMD implementation. All I need is: a bit-accurate implementation. Performance doesn't matter very much as long as it's not extremely slow. I want to use it for early stage developing and testing, so that I can compile…

c gcc arm simd neon

asked Sep 17 '15 at 13:37

user3528438

2,737
2
23
42

votes

1 answer

determinant calculation with SIMD

Does there exist an approach for calculating the determinant of matrices with low dimensions (about 4), that works well with SIMD (neon, SSE, SSE2)? I am using a hand-expansion formula, which does not work so well. I am using SSE all the way to…

sse simd neon determinants

asked May 01 '15 at 16:58

user1095108

14,119
9
58
116

votes

2 answers

Translating SSE to Neon: How to pack and then extract 32bit result

I have to translate the following instructions from SSE to Neon uint32_t a = _mm_cvtsi128_si32(_mm_shuffle_epi8(a,SHUFFLE_MASK) ); Where: static const __m128i SHUFFLE_MASK = _mm_setr_epi8(3, 7, 11, 15, -1, -1, -1, -1, …

c++ arm sse neon intrinsics

asked Mar 20 '15 at 13:29

Antonio

19,451
13
99
197

votes

0 answers

ARM NEON: Sort an array of 16 bytes

tl;dr: What is the fastest way to sort an uint8x16_t? I need to sort many arrays of exactly 16 unsigned bytes (in descending order, which doesn't matter, of course), and i'm trying to optimize sorting by means of ARM NEON vectorization. And i find…

arrays sorting assembly arm neon

asked Feb 24 '14 at 15:54

NoQ

votes

1 answer

Optimizing Cortex-A8 color conversion using NEON

I am currently doing a color conversion routine in order to convert from YUY2 to NV12. I have a function which is quite fast, but not as fast as I would expect, mainly due to cache misses. void convert_hd(uint8_t *orig, uint8_t *result) { uint32_t…

assembly arm neon cpu-cache cortex-a8

asked Feb 06 '14 at 13:52

jmh

votes

2 answers

How to use ARM intrinsics in iOS?

I need to compute MSB (most significant bit) on millions of 32-bit integers on iPad very fast. I have my own (ugly) implementation of MSB written on plain C, which is slow. ARM processors have CLZ (count leading zeroes) hardware command, which can…

ios arm neon intrinsics

asked Oct 29 '13 at 17:28

Alexander Vasenin

11,437
4
42
70

votes

3 answers

Resize 8-bit image by 2 with ARM NEON

I have an 8-bit 640x480 image that I would like to shrink to a 320x240 image: void reducebytwo(uint8_t *dst, uint8_t *src) //src is 640x480, dst is 320x240 What would be the best way to do that using ARM SIMD NEON? Any sample code somewhere? As a…

image image-processing arm simd neon

asked Jul 23 '13 at 16:34

gregoiregentil

1,793
1
26
56

votes

3 answers

Is NEON of ARM faster for integers than floating points?

Or both floating point and integer operations are same speed? And if not so, how much faster is the integer version?

c arm neon

asked May 31 '13 at 10:47

MetallicPriest

29,191
52
200
356

votes

2 answers

ARM NEON SIMD version 2

What is the difference between NEON SIMD and NEON SIMD version 2 as in Cortex A15?

arm simd neon

asked Mar 05 '13 at 15:11

user1511956

votes

2 answers

ARM NEON vectorization failure

I would like to enable NEON vectorization on my ARM cortex-a9, but I get this output at compile: "not vectorized: relevant stmt not supported: D.14140_82 = D.14143_77 * D.14141_81" Here is my loop: void my_mul(float32_t * __restrict data1, float32_t…

compiler-construction arm vectorization neon

asked Mar 05 '13 at 13:50

user2092113

votes

2 answers

ARM NEON assembly on Windows Phone 8 not working

I'm trying to call a function that is coded in ARM NEON assembly in an .s file that looks like this: AREA myfunction, code, readonly, ARM global fun align 4 fun push {r4, r5, r6, r7, lr} add r7, sp, #12 push {r8, r10, r11} sub r4,…

windows visual-studio arm windows-phone-8 neon

asked Nov 21 '12 at 07:52

Anthony Blake

5,328
2
25
24

votes

2 answers

Using ARM NEON intrinsics to add alpha and permute

I'm developing an iOS app that needs to convert images from RGB -> BGRA fairly quickly. I would like to use NEON intrinsics if possible. Is there a faster way than simply assigning the components? void neonPermuteRGBtoBGRA(unsigned char* src,…

arm neon intrinsics cortex-a8

asked Aug 09 '12 at 19:56

Nick Lee

Prev 1 2 3

…

58 59 Next