Questions tagged [neon]

NEON is a vector-processing instruction set for ARM processors. Please use this tag together with [arm] if asking about the AArch32 version of NEON (to run on 32-bit ARM processors), or [arm64] for AArch64. See also the [simd] tag.

NEON is a vector-processing instruction set for ARM processors. It's also known as Advanced SIMD (Single Instruction Multiple Data).

NEON can be used on either 32-bit or 64-bit ARM processors, as part of the AArch32 or AArch64 architectures respectively. However, there are significant differences between the AArch32 and AArch64 versions of NEON (register usage, instruction mnemonics, instruction availability), so please use this tag together with either for AArch32, or for AArch64.

The tag may also be appropriate, especially for questions about SIMD algorithms that may be implemented with NEON.

Don't forget to include a tag for the programming language you are coding in, perhaps , or . In the latter cases, consider the tags or for how you access the instructions.

More information at

  1. Neon page in ARM website
  2. Wikipedia article on ARM
885 questions
6
votes
2 answers

How to enable Neon instruction in Xcode

I want to use Neon SIMD instruction for the iphone. I heard we have to put flags "-mfloat-abi=softfp -mfpu=neon" in the "Other C Flags" field of the Target inspector, but when building I get "error: unrecognized command line option "-mfpu=neon""…
Krav
  • 61
  • 1
  • 2
6
votes
2 answers

Fastest Inverse Square Root on iPhone

I'm working on an iPhone app that involves certain physics calculations that are done thousands of times per second. I am working on optimizing the code to improve the framerate. One of the pieces that I am looking at improving is the inverse…
WolfLink
  • 3,308
  • 2
  • 26
  • 44
6
votes
2 answers

LSB to MSB bit reversal on ARM

I need to reverse an YUV image with each byte in LSB instead of MSB. I have read Best Algorithm for Bit Reversal ( from MSB->LSB to LSB->MSB) in C but I would like to do something that is ARM-optimized. int8 *image; for(i = 0; i < size; i++) { …
gregoiregentil
  • 1,793
  • 1
  • 26
  • 56
6
votes
1 answer

Constant out of range with NEON intrinsics

Im compiling the following ARM NEON intrinsics test code (in Eclipse with Android NDK): void foo(uint64_t* Res) { uint64_t x = 0xff12aa8902acf78dULL; uint64x1_t a,b; a = vld1_u64 (&x); b = vext_u64 (a, a, 3); vst1_u64…
NumberFour
  • 3,551
  • 8
  • 48
  • 72
6
votes
2 answers

neon float multiplication is slower than expected

I have two tabs of floats. I need to multiply elements from the first tab by corresponding elements from the second tab and store the result in a third tab. I would like to use NEON to parallelize floats multiplications: four float multiplications…
tomto
  • 73
  • 1
  • 5
6
votes
3 answers

Add all elements in a lane

Is there an intrinsic which allows one to add all of the elements in a lane? I am using Neon to multiply 8 numbers together, and I need to sum the result. Here is some paraphrased code to show what I'm currently doing (this could probably be…
NOP
  • 864
  • 1
  • 12
  • 26
6
votes
4 answers

Efficient floating point comparison (Cortex-A8)

There is a big (~100 000) array of floating point variables, and there is a threshold (also floating point). The problem is that I have to compare each one variable from the array with a threshold, but NEON flags transfer takes a really long time…
Alex
  • 9,891
  • 11
  • 53
  • 87
5
votes
1 answer

ARM NEON: comparing 128 bit values

I'm interested in finding the fastest way (lowest cycle count) of comparing the values stored into NEON registers (say Q0 and Q3) on a Cortex-A9 core (VFP instructions allowed). So far I have the following: (1) Using the VFP floating point…
Mircea
  • 1,841
  • 15
  • 18
5
votes
2 answers

ARM Cortex A8 Benchmarks: can someone help me make sense of these numbers?

I'm working on writing several real-time DSP algorithms on Android, so I decided to program the ARM directly in Assembly to optimize everything as much as possible and make the math maximally lightweight. At first I was getting speed benchmarks that…
Phonon
  • 12,549
  • 13
  • 64
  • 114
5
votes
3 answers

Efficient C vectors for generic SIMD (SSE, AVX, NEON) test for zero matches. (find FP max absolute value and index)

I want to see if it's possible to write some generic SIMD code that can compile efficiently. Mostly for SSE, AVX, and NEON. A simplified version of the problem is: Find the maximum absolute value of an array of floating point numbers and return…
TrentP
  • 4,240
  • 24
  • 35
5
votes
1 answer

uint8 to float using SIMD Neon intrinsics

I'm trying to optimize my code that converts grayscale images to float images which runs on Neon A64/v8. The current implementation is quite fast using OpenCV's convertTo() (that compiled for android), but this is still our bottleneck. So I came up…
Chen
  • 51
  • 2
5
votes
3 answers

Neon Optimization using intrinsics

Learning about ARM NEON intrinsics, I was timing a function that I wrote to double the elements in an array.The version that used the intrinsics takes more time than a plain C version of the function. Without NEON : void …
itisravi
  • 3,406
  • 3
  • 23
  • 30
5
votes
3 answers

Is numpy optimized for raspberry-pi automatically

The Raspberry Pi ( armv7l architecture ) has neon vfpv4 support which can be used for optimization. Does the standard version of numpy include these optimizations when installing the command pip3 install numpy or apt-get python3-numpy? I am not…
Dan Erez
  • 1,364
  • 15
  • 16
5
votes
1 answer

What exact difference is between NEON and SIMD instructions in cortex M7

As per my understanding by referring to many links to ARM's site I understand Cortex-M7 doesn't support NEON instructions, but the host (CORTEX-M7) processor that we are using in our organization specifies "ARM Cortex-M7 with single precision…
5
votes
0 answers

Hardware optimizations using Qualcomm Snapdragon 800 and Adreno 330

I am developing a real-time computer vision project that runs on an Ubuntu (Linaro) board with an ARM CPU (Snapdragon 800). Some parts of the software operate on HD images, huge amount of data. This slows the execution and acts as a…
avi123
  • 51
  • 3