Questions tagged [neon]

NEON is a vector-processing instruction set for ARM processors. Please use this tag together with [arm] if asking about the AArch32 version of NEON (to run on 32-bit ARM processors), or [arm64] for AArch64. See also the [simd] tag.

NEON is a vector-processing instruction set for ARM processors. It's also known as Advanced SIMD (Single Instruction Multiple Data).

NEON can be used on either 32-bit or 64-bit ARM processors, as part of the AArch32 or AArch64 architectures respectively. However, there are significant differences between the AArch32 and AArch64 versions of NEON (register usage, instruction mnemonics, instruction availability), so please use this tag together with either for AArch32, or for AArch64.

The tag may also be appropriate, especially for questions about SIMD algorithms that may be implemented with NEON.

Don't forget to include a tag for the programming language you are coding in, perhaps , or . In the latter cases, consider the tags or for how you access the instructions.

More information at

  1. Neon page in ARM website
  2. Wikipedia article on ARM
885 questions
7
votes
3 answers

How can I optimize a looped 4D matrix-vector-multiplication with ARM NEON?

I'm working on optimizing a 4D (128 Bit) matrix-vector multiplication using ARM NEON Assembler. If I load the matrix, and the vector into the NEON Registers and transform it, I won't get a great performance boost, because the switch to the NEON…
oc1d
  • 233
  • 1
  • 3
  • 9
7
votes
1 answer

ARM and NEON can work in parallel?

This is with reference to question: Checksum code implementation for Neon in Intrinsics Opening the sub-questions listed in the link as separate individual questions. As multi questions aren't to be asked as a part of single thread. Anyway coming…
nguns
  • 440
  • 6
  • 21
7
votes
2 answers

On iOS how to quickly convert RGB24 to BGR24?

I use vImageConvert_RGB888toPlanar8 and vImageConvert_Planar8toRGB888 from Accelerate.framework to convert RGB24 to BGR24, but when the data need to transform is very big, such as 3M or 4M, the time need to spend on this is about 10ms. So some one…
zhzhy
  • 461
  • 3
  • 17
6
votes
2 answers

Converting between SSE and NEON Intrinsics-Shuffling

I am trying to convert a code written in SSE3 intrinsics to NEON SIMD and am stuck because of a shuffle function.I have looked at the GCC Intrinsics ,ARM manuals and other forums but have not been able to find a solution. CODE: _m128i upper =…
Rahul
  • 115
  • 2
  • 7
6
votes
3 answers

Problems with Qualcomm Scorpion dual-core ARM NEON code?

I am developing a native library for Android where I use ARM assembly optimizations and multithreading in order to get maximum performance on the dual-core ARM chipset MSM8660. While doing some measurements I noticed the following: The…
Leo
  • 2,328
  • 2
  • 21
  • 41
6
votes
2 answers

Why does gcc, with -O3, unnecessarily clear a local ARM NEON array?

Consider the following code (Compiler Explorer link), compiled under gcc and clang with -O3 optimization: #include void bug(int8_t *out, const int8_t *in) { for (int i = 0; i < 2; i++) { int8x16x4_t x; x.val[0] =…
swineone
  • 2,296
  • 1
  • 18
  • 32
6
votes
3 answers

Sum all elements in a quadword vector in ARM assembly with NEON

Im rather new to assembly and although the arm information center is often helpful sometimes the instructions can be a little confusing to a newbie. Basically what I need to do is sum 4 float values in a quadword register and store the result in a…
A Person
  • 801
  • 1
  • 10
  • 22
6
votes
1 answer

Mixing NEON assembly with non-vector functions

I think I found the answer to my question. There is an "fmacs" instruction for VFP which may do the trick which does scalar computation on NEON/VFP registers. I'm very new to NEON or ARM programming... I want to load up an upper triangular matrix…
paul
  • 257
  • 4
  • 13
6
votes
3 answers

ARM GCC bug? Uses chains of vldr instead of one vldmia…

Consider the following NEON-optimized function: void mat44_multiply_neon(float32x4x4_t& result, const float32x4x4_t& a, const float32x4x4_t& b) { // Make sure "a" is mapped to registers in the d0-d15 range, // as requested by NEON multiply…
jcayzac
  • 1,441
  • 1
  • 13
  • 26
6
votes
1 answer

Short to Float and viceversa conversion using NEON SIMD

I am processing audio buffers in Android, the setup I have is as follows: get system callback with a short buffer convert short buffer to float buffer do some DSP with float buffer convert float buffer to short buffer deliver short buffer to…
alexm
  • 1,285
  • 18
  • 36
6
votes
0 answers

_mm_cmpestri instruction alternative on NEON

I am trying to run PicoHTTPParser on ARM platform. There is a function that uses SSE instruction _mm_cmpestri for fast char comparisson. Is there an alternative in NEON? It seems I'll have to use something like VCGT.U8, but it does not look very…
maxlovic
  • 67
  • 5
6
votes
2 answers

How portable are the new ARM SVE instructions?

I am looking for information about the new Scalable Vector Unit (SVE) from Arm. It looks amazingly good to me for doing Image processing with beeing able to compute 2048 bit in parallel and so on. But I'm not sure if it will be running on every…
6
votes
1 answer

Intel / ARM intrinsics equivalence

I have a C application using Intel intrinsics like: __m128 _mm_add_ps (__m128 a, __m128 b) __m128 _mm_sub_ps (__m128 a, __m128 b) __m128 _mm_mul_ps (__m128 a, __m128 b) __m128 _mm_set_ps (float e3, float e2, float e1, float e0) void _mm_store_ps…
A.nechi
  • 521
  • 1
  • 5
  • 15
6
votes
5 answers

Load 8bit uint8_t as uint32_t?

my image processing project works with grayscale images. I have ARM Cortex-A8 processor platform. I want to make use of the NEON. I have a grayscale image( consider the example below) and in my alogorithm, I have to add only the columns. How can I…
HaggarTheHorrible
  • 7,083
  • 20
  • 70
  • 81
6
votes
1 answer

Fast search/replace of matching single bytes in a 8-bit array, on ARM

I develop image processing algorithms (using GCC, targeting ARMv7 (Raspberry Pi 2B)). In particular I use a simple algorithm, which changes index in a mask: void ChangeIndex(uint8_t * mask, size_t size, uint8_t oldIndex, uint8_t newIndex) { …
user5480694