Questions tagged [neon]

NEON is a vector-processing instruction set for ARM processors. Please use this tag together with [arm] if asking about the AArch32 version of NEON (to run on 32-bit ARM processors), or [arm64] for AArch64. See also the [simd] tag.

NEON is a vector-processing instruction set for ARM processors. It's also known as Advanced SIMD (Single Instruction Multiple Data).

NEON can be used on either 32-bit or 64-bit ARM processors, as part of the AArch32 or AArch64 architectures respectively. However, there are significant differences between the AArch32 and AArch64 versions of NEON (register usage, instruction mnemonics, instruction availability), so please use this tag together with either arm for AArch32, or arm64 for AArch64.

The simd tag may also be appropriate, especially for questions about SIMD algorithms that may be implemented with NEON.

Don't forget to include a tag for the programming language you are coding in, perhaps assembly, c or c++. In the latter cases, consider the tags intrinsics or inline-assembly for how you access the instructions.

More information at

885 questions

votes

1 answer

ARM NEON my calculation result when there are negative numbers is incorrect

I am trying to calculate the following using neon in assembly ((200*(53-255))/255) + 255 whose result should equal approx 97 I've tested here http://szeged.github.io/nevada/ and also on a dual-core Cortex-A7 ARM CPU tablet. And the result is 243…

assembly arm neon alphablending

asked Jul 27 '14 at 01:12

bfalz

votes

2 answers

Assembly code for Neon Intrinsic Version

I am new to Neon Assembly programming, I have developed a Neon Intrinsic version of video edge detection algorithm, it resulted in gaining 2x performance. Now I would like to try Neon assembly - I would like to view the assembly code generated by…

android assembly android-ndk neon

asked Jun 30 '14 at 10:14

Kaliuday

votes

1 answer

ARM Neon VLD1 instruction loading register twice

The following code loads identical data into D16,D17 as well as D18,D19: vld1.16 {d16, d17, d18, d19}, [R1, :128]! I tried splitting the loads out separately like so vld1.16 d16, [R1, :64]! This also loaded the data twice into d16…

xcode assembly arm lldb neon

asked Jun 22 '14 at 22:37

Possum

votes

1 answer

some doubts regarding cycles of ARM NEON

I wrote some neon code in assembly and was aiming at maximum optimization. Though latency due to register conflict and pipeline is reduced it is showing only 1 cycle difference i.e before n.70-0 after n.69-0. why it is showing like that i did n't…

arm inline-assembly simd neon cortex-a8

asked Jun 13 '14 at 06:33

Sri

votes

2 answers

Is NEON extension present at all modern ARM SoCs?

And can we hope it will be supported in all future mobile devices on ARM architecture (including NVIDIA Tegra)?

arm neon

asked May 07 '14 at 11:48

Zloten

votes

0 answers

I don't know why this arm neon inline assembly code doesn't work

void div_tl_128(unsigned char* data_mat, int b, int Matrix_Size) { int k = 0; int count = Matrix_Size >> Bytes_Shift; if(count == 0) return; __asm__ __volatile__( "lsl r1, #4 \n" …

c assembly arm inline-assembly neon

asked Apr 30 '14 at 12:41

user3589306

votes

3 answers

NEON memcpy , memset and using .c with .s files

I am trying to get familiar with Neon instructions. Both assembly and intrinsics. I usee gcc V4.8.2 hardfp I would like to use the NEON memcpy with preload accordindg to :…

assembly arm neon

asked Apr 24 '14 at 10:06

Nick

votes

0 answers

How do I use ARM NEON intrinsics?

Basically I'm developing for an iPhone and I compile fine on the Mac, however I want to use NEON intrinsics to accelerate my vector math. I have experience with SSE and AVX, however I have no idea where to get the NEON header with the intrinsics…

c++ xcode arm neon

asked Mar 18 '14 at 20:13

ulak blade

2,515
5
37
81

votes

1 answer

ARM-NEON: Conditional register swapping based on parameters

I am writing a piece of subroutine in NEON for image processing which does color swapping, i.e., I sequentialy load the R,G,B channels from an array, and depending on some configuration, permute some of them. There are as maximum 6 permutes (RGB)…

performance optimization assembly arm neon

asked Mar 07 '14 at 19:45

Jordi C.

votes

1 answer

Neon intrinsics with complex numbers

I have a lot of calculations with complex numbers (usually an array containing a struct consisting of two floats to represent im and re; see below) and want to speed them up with the NEON C intrinsics. It would be awesome if you could give me an…

c gcc arm neon

asked Feb 18 '14 at 23:16

marcel

votes

1 answer

Compile error using Neon in NaCl

I added the command line setting “-mfpu=neon” so that I could use NEON instructions. But that causes a weird compile error: 1>C:\Misc\nacl_sdk\vs_addin\examples\video_app\hello_world_gles\src\YUVBlock16x8.cpp(158,1): internal compiler error : in…

neon google-nativeclient

asked Jan 28 '14 at 18:34

NaClPM

votes

2 answers

Optimize RGBA->RGB arm64 assembly

I wrote this very naive NEON implementation to convert from RGBA to RGB. It works but I was wondering if there was anything else I could do to further improve performances. I tried playing around with the prefetching size and unrolling the loop a…

iphone assembly arm neon arm64

asked Dec 18 '13 at 16:06

Tomas Camin

9,996
2
43
62

votes

1 answer

SIMD Registers in ARM processor

My question is about ARM NEON. First question is about the register's size. I'd like to know Apple A6's and Cortex A15's actual SIMD register size. Second question is about the SIMD instruction's cycle. I assume that lot of ARM processor's NEON…

neon

asked Dec 18 '13 at 00:56

Henrik

votes

1 answer

libjpeg-turbo for Android: how to organize runtime selection of NEON / non-NEON code?

I'm using a libjpeg-turbo port for Android. It's not much different from the base jpeg-turbo in terms of source code: http://git.linaro.org/gitweb?p=people/tomgall/libjpeg-turbo/libjpeg-turbo.git;a=shortlog;h=refs/heads/android There is a module…

android android-ndk neon libjpeg libjpeg-turbo

asked Dec 12 '13 at 14:09

Violet Giraffe

32,368
48
194
335

votes

2 answers

About vsubq_u16(uint16x8_t, uint16x8_t)

About vsubq_u16(uint16x8_t a, uint16x8_t b) The return value is also uint16x8_t. Then if a is smaller than b, we will get a very large uint16x8_t instead of a negative value, it's not what I need. If I have such requirement, uint16_t c =…

arm neon intrinsics

asked Dec 12 '13 at 03:29

BonderWu

Prev 1 2 3

…

58 59 Next