Questions tagged [neon]

NEON is a vector-processing instruction set for ARM processors. Please use this tag together with [arm] if asking about the AArch32 version of NEON (to run on 32-bit ARM processors), or [arm64] for AArch64. See also the [simd] tag.

NEON is a vector-processing instruction set for ARM processors. It's also known as Advanced SIMD (Single Instruction Multiple Data).

NEON can be used on either 32-bit or 64-bit ARM processors, as part of the AArch32 or AArch64 architectures respectively. However, there are significant differences between the AArch32 and AArch64 versions of NEON (register usage, instruction mnemonics, instruction availability), so please use this tag together with either arm for AArch32, or arm64 for AArch64.

The simd tag may also be appropriate, especially for questions about SIMD algorithms that may be implemented with NEON.

Don't forget to include a tag for the programming language you are coding in, perhaps assembly, c or c++. In the latter cases, consider the tags intrinsics or inline-assembly for how you access the instructions.

More information at

885 questions

votes

3 answers

How can I optimize a looped 4D matrix-vector-multiplication with ARM NEON?

I'm working on optimizing a 4D (128 Bit) matrix-vector multiplication using ARM NEON Assembler. If I load the matrix, and the vector into the NEON Registers and transform it, I won't get a great performance boost, because the switch to the NEON…

android c android-ndk arm neon

asked Oct 19 '12 at 14:36

oc1d

votes

1 answer

ARM and NEON can work in parallel?

This is with reference to question: Checksum code implementation for Neon in Intrinsics Opening the sub-questions listed in the link as separate individual questions. As multi questions aren't to be asked as a part of single thread. Anyway coming…

arm inline-assembly simd neon cortex-a8

asked Sep 05 '12 at 08:37

nguns

votes

2 answers

On iOS how to quickly convert RGB24 to BGR24?

I use vImageConvert_RGB888toPlanar8 and vImageConvert_Planar8toRGB888 from Accelerate.framework to convert RGB24 to BGR24, but when the data need to transform is very big, such as 3M or 4M, the time need to spend on this is about 10ms. So some one…

ios assembly rgb neon accelerate-framework

asked Jul 27 '12 at 08:07

zhzhy

votes

2 answers

Converting between SSE and NEON Intrinsics-Shuffling

I am trying to convert a code written in SSE3 intrinsics to NEON SIMD and am stuck because of a shuffle function.I have looked at the GCC Intrinsics ,ARM manuals and other forums but have not been able to find a solution. CODE: _m128i upper =…

sse shuffle neon intrinsics

asked Nov 01 '11 at 03:25

Rahul

votes

3 answers

Problems with Qualcomm Scorpion dual-core ARM NEON code?

I am developing a native library for Android where I use ARM assembly optimizations and multithreading in order to get maximum performance on the dual-core ARM chipset MSM8660. While doing some measurements I noticed the following: The…

performance assembly arm multicore neon

asked Sep 29 '11 at 11:54

Leo

2,328
2
21
41

votes

2 answers

Why does gcc, with -O3, unnecessarily clear a local ARM NEON array?

Consider the following code (Compiler Explorer link), compiled under gcc and clang with -O3 optimization: #include void bug(int8_t *out, const int8_t *in) { for (int i = 0; i < 2; i++) { int8x16x4_t x; x.val[0] =…

c gcc arm64 neon compiler-bug

asked Oct 07 '21 at 22:30

swineone

2,296
1
18
32

votes

3 answers

Sum all elements in a quadword vector in ARM assembly with NEON

Im rather new to assembly and although the arm information center is often helpful sometimes the instructions can be a little confusing to a newbie. Basically what I need to do is sum 4 float values in a quadword register and store the result in a…

math assembly arm neon

asked Aug 03 '11 at 18:17

A Person

votes

1 answer

Mixing NEON assembly with non-vector functions

I think I found the answer to my question. There is an "fmacs" instruction for VFP which may do the trick which does scalar computation on NEON/VFP registers. I'm very new to NEON or ARM programming... I want to load up an upper triangular matrix…

assembly arm neon

asked May 16 '11 at 19:01

paul

votes

3 answers

ARM GCC bug? Uses chains of vldr instead of one vldmia…

Consider the following NEON-optimized function: void mat44_multiply_neon(float32x4x4_t& result, const float32x4x4_t& a, const float32x4x4_t& b) { // Make sure "a" is mapped to registers in the d0-d15 range, // as requested by NEON multiply…

gcc assembly arm neon

asked Dec 24 '10 at 08:59

jcayzac

1,441
1
13
26

votes

1 answer

Short to Float and viceversa conversion using NEON SIMD

I am processing audio buffers in Android, the setup I have is as follows: get system callback with a short buffer convert short buffer to float buffer do some DSP with float buffer convert float buffer to short buffer deliver short buffer to…

android arm simd neon

asked May 17 '17 at 21:30

alexm

1,285
18
36

votes

0 answers

_mm_cmpestri instruction alternative on NEON

I am trying to run PicoHTTPParser on ARM platform. There is a function that uses SSE instruction _mm_cmpestri for fast char comparisson. Is there an alternative in NEON? It seems I'll have to use something like VCGT.U8, but it does not look very…

c performance string-comparison neon

asked Dec 22 '16 at 08:01

maxlovic

votes

2 answers

How portable are the new ARM SVE instructions?

I am looking for information about the new Scalable Vector Unit (SVE) from Arm. It looks amazingly good to me for doing Image processing with beeing able to compute 2048 bit in parallel and so on. But I'm not sure if it will be running on every…

arm neon arm64 sve

asked Dec 21 '16 at 13:04

Felix Yah Batta Man

votes

1 answer

Intel / ARM intrinsics equivalence

I have a C application using Intel intrinsics like: __m128 _mm_add_ps (__m128 a, __m128 b) __m128 _mm_sub_ps (__m128 a, __m128 b) __m128 _mm_mul_ps (__m128 a, __m128 b) __m128 _mm_set_ps (float e3, float e2, float e1, float e0) void _mm_store_ps…

c arm intrinsics neon gem5

asked Aug 12 '16 at 13:50

A.nechi

votes

5 answers

Load 8bit uint8_t as uint32_t?

my image processing project works with grayscale images. I have ARM Cortex-A8 processor platform. I want to make use of the NEON. I have a grayscale image( consider the example below) and in my alogorithm, I have to add only the columns. How can I…

arm neon intrinsics cortex-a

asked Sep 09 '10 at 09:58

HaggarTheHorrible

7,083
20
70
81

votes

1 answer

Fast search/replace of matching single bytes in a 8-bit array, on ARM

I develop image processing algorithms (using GCC, targeting ARMv7 (Raspberry Pi 2B)). In particular I use a simple algorithm, which changes index in a mask: void ChangeIndex(uint8_t * mask, size_t size, uint8_t oldIndex, uint8_t newIndex) { …

c++ image-processing arm simd neon

asked Jan 28 '16 at 08:14

user5480694

Prev 1 2 3

…

58 59 Next