Questions tagged [neon]

NEON is a vector-processing instruction set for ARM processors. Please use this tag together with [arm] if asking about the AArch32 version of NEON (to run on 32-bit ARM processors), or [arm64] for AArch64. See also the [simd] tag.

NEON is a vector-processing instruction set for ARM processors. It's also known as Advanced SIMD (Single Instruction Multiple Data).

NEON can be used on either 32-bit or 64-bit ARM processors, as part of the AArch32 or AArch64 architectures respectively. However, there are significant differences between the AArch32 and AArch64 versions of NEON (register usage, instruction mnemonics, instruction availability), so please use this tag together with either for AArch32, or for AArch64.

The tag may also be appropriate, especially for questions about SIMD algorithms that may be implemented with NEON.

Don't forget to include a tag for the programming language you are coding in, perhaps , or . In the latter cases, consider the tags or for how you access the instructions.

More information at

  1. Neon page in ARM website
  2. Wikipedia article on ARM
885 questions
0
votes
1 answer

ARM NEON my calculation result when there are negative numbers is incorrect

I am trying to calculate the following using neon in assembly ((200*(53-255))/255) + 255 whose result should equal approx 97 I've tested here http://szeged.github.io/nevada/ and also on a dual-core Cortex-A7 ARM CPU tablet. And the result is 243…
bfalz
  • 92
  • 9
0
votes
2 answers

Assembly code for Neon Intrinsic Version

I am new to Neon Assembly programming, I have developed a Neon Intrinsic version of video edge detection algorithm, it resulted in gaining 2x performance. Now I would like to try Neon assembly - I would like to view the assembly code generated by…
Kaliuday
  • 110
  • 8
0
votes
1 answer

ARM Neon VLD1 instruction loading register twice

The following code loads identical data into D16,D17 as well as D18,D19: vld1.16 {d16, d17, d18, d19}, [R1, :128]! I tried splitting the loads out separately like so vld1.16 d16, [R1, :64]! This also loaded the data twice into d16…
Possum
  • 13
  • 2
0
votes
1 answer

some doubts regarding cycles of ARM NEON

I wrote some neon code in assembly and was aiming at maximum optimization. Though latency due to register conflict and pipeline is reduced it is showing only 1 cycle difference i.e before n.70-0 after n.69-0. why it is showing like that i did n't…
Sri
  • 119
  • 1
  • 4
0
votes
2 answers

Is NEON extension present at all modern ARM SoCs?

And can we hope it will be supported in all future mobile devices on ARM architecture (including NVIDIA Tegra)?
Zloten
  • 97
  • 2
  • 4
0
votes
0 answers

I don't know why this arm neon inline assembly code doesn't work

void div_tl_128(unsigned char* data_mat, int b, int Matrix_Size) { int k = 0; int count = Matrix_Size >> Bytes_Shift; if(count == 0) return; __asm__ __volatile__( "lsl r1, #4 \n" …
0
votes
3 answers

NEON memcpy , memset and using .c with .s files

I am trying to get familiar with Neon instructions. Both assembly and intrinsics. I usee gcc V4.8.2 hardfp I would like to use the NEON memcpy with preload accordindg to :…
Nick
  • 181
  • 2
  • 11
0
votes
0 answers

How do I use ARM NEON intrinsics?

Basically I'm developing for an iPhone and I compile fine on the Mac, however I want to use NEON intrinsics to accelerate my vector math. I have experience with SSE and AVX, however I have no idea where to get the NEON header with the intrinsics…
ulak blade
  • 2,515
  • 5
  • 37
  • 81
0
votes
1 answer

ARM-NEON: Conditional register swapping based on parameters

I am writing a piece of subroutine in NEON for image processing which does color swapping, i.e., I sequentialy load the R,G,B channels from an array, and depending on some configuration, permute some of them. There are as maximum 6 permutes (RGB)…
Jordi C.
  • 339
  • 2
  • 12
0
votes
1 answer

Neon intrinsics with complex numbers

I have a lot of calculations with complex numbers (usually an array containing a struct consisting of two floats to represent im and re; see below) and want to speed them up with the NEON C intrinsics. It would be awesome if you could give me an…
marcel
  • 25
  • 6
0
votes
1 answer

Compile error using Neon in NaCl

I added the command line setting “-mfpu=neon” so that I could use NEON instructions. But that causes a weird compile error: 1>C:\Misc\nacl_sdk\vs_addin\examples\video_app\hello_world_gles\src\YUVBlock16x8.cpp(158,1): internal compiler error : in…
NaClPM
  • 131
  • 1
  • 4
0
votes
2 answers

Optimize RGBA->RGB arm64 assembly

I wrote this very naive NEON implementation to convert from RGBA to RGB. It works but I was wondering if there was anything else I could do to further improve performances. I tried playing around with the prefetching size and unrolling the loop a…
Tomas Camin
  • 9,996
  • 2
  • 43
  • 62
0
votes
1 answer

SIMD Registers in ARM processor

My question is about ARM NEON. First question is about the register's size. I'd like to know Apple A6's and Cortex A15's actual SIMD register size. Second question is about the SIMD instruction's cycle. I assume that lot of ARM processor's NEON…
Henrik
  • 421
  • 1
  • 4
  • 12
0
votes
1 answer

libjpeg-turbo for Android: how to organize runtime selection of NEON / non-NEON code?

I'm using a libjpeg-turbo port for Android. It's not much different from the base jpeg-turbo in terms of source code: http://git.linaro.org/gitweb?p=people/tomgall/libjpeg-turbo/libjpeg-turbo.git;a=shortlog;h=refs/heads/android There is a module…
Violet Giraffe
  • 32,368
  • 48
  • 194
  • 335
0
votes
2 answers

About vsubq_u16(uint16x8_t, uint16x8_t)

About vsubq_u16(uint16x8_t a, uint16x8_t b) The return value is also uint16x8_t. Then if a is smaller than b, we will get a very large uint16x8_t instead of a negative value, it's not what I need. If I have such requirement, uint16_t c =…
BonderWu
  • 133
  • 1
  • 10