Questions tagged [neon]

NEON is a vector-processing instruction set for ARM processors. Please use this tag together with [arm] if asking about the AArch32 version of NEON (to run on 32-bit ARM processors), or [arm64] for AArch64. See also the [simd] tag.

NEON is a vector-processing instruction set for ARM processors. It's also known as Advanced SIMD (Single Instruction Multiple Data).

NEON can be used on either 32-bit or 64-bit ARM processors, as part of the AArch32 or AArch64 architectures respectively. However, there are significant differences between the AArch32 and AArch64 versions of NEON (register usage, instruction mnemonics, instruction availability), so please use this tag together with either for AArch32, or for AArch64.

The tag may also be appropriate, especially for questions about SIMD algorithms that may be implemented with NEON.

Don't forget to include a tag for the programming language you are coding in, perhaps , or . In the latter cases, consider the tags or for how you access the instructions.

More information at

  1. Neon page in ARM website
  2. Wikipedia article on ARM
885 questions
5
votes
2 answers

Optimization using NEON assembly

I am trying to optimize some parts of OpenCV code using NEON. Here is the original code block I work on. (Note: If it is of any importance, you can find the full source at "opencvfolder/modules/video/src/lkpyramid.cpp". It is an implementation of an…
akaya
  • 1,140
  • 9
  • 27
5
votes
1 answer

Fast ARM NEON memcpy

I want to copy an image on an ARMv7 core. The naive implementation is to call memcpy per line. for(i = 0; i < h; i++) { memcpy(d, s, w); s += sp; d += dp; } I know that the following d, dp, s, sp, w are all 32-byte aligned, so my next (still…
robbie_c
  • 2,428
  • 1
  • 19
  • 28
4
votes
2 answers

ARM NEON debugging for Android NDK

The NDK (Android Native development Kit) for ARM comes with a gcc and GNU utils toolchain, including an elderly GDB. However, the GDB seems unable to show the contents of registers in the VFP or NEON SIMD extensions - that is, in debugging a program…
grrussel
  • 7,209
  • 8
  • 51
  • 71
4
votes
2 answers

fast comparison of arrays in iOS

I need to move a small 2D array of values around a much larger 2D array of values, and set any values of the larger array that are greater than the corresponding values in the smaller array to the values of the smaller array. Think image…
Davido
  • 2,913
  • 24
  • 38
4
votes
2 answers

Fast Pixel Count on Binary Image- ARM neon intrinsics - iOS Dev

Can someone tell me a fast function to count the number of white pixels in a binary image. I need it for iOS app dev. I am working directly on the memory of the image defined as bool *imageData = (bool *) malloc(noOfPixels * sizeof(bool)); I am…
shreyas253
  • 77
  • 1
  • 7
4
votes
4 answers

ARM NEON simple low pass filter vectorization

I have a simple single pole low pass filter (for parameter smoothing) that can be explained by the following formula: y[n] = (1-a) * y[n-1] + a * x[n] How to effective vectorize this case on ARM Neon - using intrinsics? Is it possible? The problem…
4
votes
2 answers

NEON vectorize sum of products of unsigned bytes: (a[i]-int1) * (b[i]-int2)

I need to improve a loop, because is called by my application thousands of times. I suppose I need to do it with Neon, but I don´t know where to begin. Assumptions / pre-conditions: w is always 320 (multiple of 16/32). pa and pb are 16-byte…
Gustavo
  • 785
  • 1
  • 12
  • 31
4
votes
3 answers

ARM NEON: Which pairs of instructions have to wait for write back?

In the ARM NEON documentation, it says: [...] some pairs of instructions might have to wait until the value is written back to the register file. I haven't come across a list that defines the instruction pairs that can use forwarded results and…
Anthony Blake
  • 5,328
  • 2
  • 25
  • 24
4
votes
1 answer

Is there a way to treat the register file as an array in ARMv8 (scalar or Neon)?

Suppose I have a short array v of say 8 int64_t. I have an algorithm that needs to access different elements of that array, which are not compile-time constants, e.g. something like v[(i + j)/2] += ... in which i and j are variables not subject to…
swineone
  • 2,296
  • 1
  • 18
  • 32
4
votes
3 answers

Aliasing of NEON vector data types

Does NEON support aliasing of the vector data types with their scalar components? E.g.(Intel SSE) typedef long long __m128i __attribute__ ((__vector_size__ (16), __may_alias__)); The above will allow me to do: __m128i* somePtr; somePtr++;//advance…
celavek
  • 5,575
  • 6
  • 41
  • 69
4
votes
3 answers

fast bit-matrix (64x64) transpose algorithm using SIMD (ARM)

I am trying to understand, if there is a fast way to do a matrix transpose (64x64 bits) using ARM SIMD instructions. I tried to explore the VTRN instruction of ARM SIMD but am not sure of its effective application in this scenario. The input matrix…
sourabh jaiswal
  • 1,310
  • 2
  • 14
  • 19
4
votes
2 answers

Transposing 8x8 float matrix using NEON intrinsics

I have a program that needs to run a transpose operation on 8x8 float32 matrices many times. I want to transpose these using NEON SIMD intrinsics. I know that the array will always contain 8x8 float elements. I have a baseline non-intrinsic solution…
bickit
  • 41
  • 3
4
votes
3 answers

NEON Assembly manual / tutorial with GNU assembler

Are there any resources that would cover syntax of using NEON Assembly with GNU assembler? I've read that syntax differs from the one using RVCT assembler, but that's the only thing I can find documentation for. Are there any good resources out…
Phonon
  • 12,549
  • 13
  • 64
  • 114
4
votes
2 answers

How to extend a int32x2_t to a int32x4_t with NEON intrinsics on clang/AArch64 when you don't care about the new lanes?

Fellow ARMists, I'd like to narrow and saturate 2 s32 to 2 s16 with NEON code, and pack them in a GPR. I need to conform to a certain API, so please don't discuss efficiency or design here :) Here's the snippet: int32x2_t stuff32 = ...; int16x4_t…
Tramboi
  • 151
  • 5
4
votes
4 answers

NEON ASM code running much slower than C code?

I'm trying to implement Gauss-Newton optimization for a specific problem on iPhone ARM using NEON. The first function below is my original C function. The second is the NEON asm code I wrote. I ran each one 100,000 times and the NEON version takes…
paul
  • 257
  • 4
  • 13