Questions tagged [neon]

NEON is a vector-processing instruction set for ARM processors. Please use this tag together with [arm] if asking about the AArch32 version of NEON (to run on 32-bit ARM processors), or [arm64] for AArch64. See also the [simd] tag.

NEON is a vector-processing instruction set for ARM processors. It's also known as Advanced SIMD (Single Instruction Multiple Data).

NEON can be used on either 32-bit or 64-bit ARM processors, as part of the AArch32 or AArch64 architectures respectively. However, there are significant differences between the AArch32 and AArch64 versions of NEON (register usage, instruction mnemonics, instruction availability), so please use this tag together with either arm for AArch32, or arm64 for AArch64.

The simd tag may also be appropriate, especially for questions about SIMD algorithms that may be implemented with NEON.

Don't forget to include a tag for the programming language you are coding in, perhaps assembly, c or c++. In the latter cases, consider the tags intrinsics or inline-assembly for how you access the instructions.

More information at

885 questions

votes

2 answers

Which one is better, gcc or armcc for NEON optimizations?

Refering to @auselen's answer here: Using ARM NEON intrinsics to add alpha and permute, looks like armcc compiler is far more better than the gcc compiler for NEON optimizations. Is this really true? I haven't really tried armcc compiler. But I got…

embedded arm simd neon cortex-a8

asked Sep 25 '12 at 06:49

nguns

votes

0 answers

attempt to convert SSE2 Fast Corner score code to ARM Neon

I was trying to port some SSE2 code (fast corner detector score computation) using ARM Neon instruction. The code is quite simple at first sight but the results are different for some reason. The thing is that sometimes the difference may be quite…

arm sse neon computer-vision

asked Aug 07 '12 at 22:08

inspirit

votes

2 answers

Neon equivalent to SSE intrinsics

I'm trying to convert a c code to an optimized one using neon intrinsics. Here is the c codes that operate over 2 operants not over vectors of operants. uint16_t mult_z216(uint16_t a,uint16_t b){ unsigned int c1 = a*b; if(c1) { int…

c arm sse multiplication neon

asked Jul 02 '12 at 11:37

Kami

5,959
8
38
51

votes

3 answers

128-bit rotation using ARM Neon intrinsics

I'm trying to optimize my code using Neon intrinsics. I have a 24-bit rotation over a 128-bit array (8 each uint16_t). Here is my c code: uint16_t rotated[8]; uint16_t temp[8]; uint16_t j; for(j = 0; j < 8; j++) { //Rotation <<< 24 over 128…

c rotation intrinsics neon

asked Jun 29 '12 at 09:48

Kami

5,959
8
38
51

votes

2 answers

Android CPU ARM architectures

We have a Android CPU dependent code and I would like to see how many devices used by customers are ARMv6/ARMv7, if there are still ARM v5, how many of ARMv6 have VFP, what is the Tegra or Neon percentage. Any tips where such statistics could be…

android arm neon tegra

asked Jun 06 '12 at 19:20

STeN

6,262
22
80
125

votes

3 answers

Loop takes more cycles to execute than expected in an ARM Cortex-A72 CPU

Consider the following code, running on an ARM Cortex-A72 processor (optimization guide here). I have included what I expect are resource pressures for each execution port: Instruction B I0 I1 M L S F0 F1 .LBB0_1: ldr q3, [x1],…

performance assembly optimization arm neon

asked Nov 05 '21 at 15:31

swineone

2,296
1
18
32

votes

3 answers

Is there an advantage of specifying "-mfpu=neon-vfpv3" over "-mfpu=neon" for ARMs with separate pipelines?

My Zynq-7000 ARM Cortex-A9 Processor has both the NEON and the VFPv3 extension and the Zynq-7000-TRM says that the processor is configured to have "Independent pipelines for VFPv3 and advanced SIMD instructions". So far I compiled my programs with…

gcc assembly arm neon armv7

asked Dec 12 '17 at 08:54

Johannes Schaub - litb

496,577
130
894
1,212

votes

4 answers

ARM Neon: How to convert from uint8x16_t to uint8x8x2_t?

I recently discovered about the vreinterpret{q}_dsttype_srctype casting operator. However this doesn't seem to support conversion in the data type described at this link (bottom of the page): Some intrinsics use an array of vector types of the…

c++ c arm vectorization neon

asked Apr 20 '17 at 13:38

Antonio

19,451
13
99
197

votes

1 answer

Detect ARM NEON availability in the preprocessor?

According to the ARM ARM, __ARM_NEON__ is defined when Neon SIMD instructions are available. I'm having trouble getting GCC to provide it. Neon available on this BananaPi Pro dev board running Debian 8.2: $ cat /proc/cpuinfo | grep neon Features …

gcc macros arm c-preprocessor neon

asked May 05 '16 at 12:23

jww

97,681
90
411
885

votes

5 answers

Fast conversion of 16-bit big-endian to little-endian in ARM

I need to convert big arrays of 16-bit integer values from big-endian to little-endian format. Now I use for conversion the following function: inline void Reorder16bit(const uint8_t * src, uint8_t * dst) { uint16_t value = *(uint16_t*)src; …

c++ arm simd neon

asked Nov 26 '15 at 06:36

user5480682

votes

1 answer

How to convert _mm_shuffle_ps SSE intrinsic to NEON intrinsic?

I am trying to convert codes written in SSE to NEON SIMD and got stuck because of the _mm_shuffle_ps SSE intrinsic. Here is the code: b = _mm_shuffle_ps(a, b, 136); a, b, c are all the __m128 registers. Now I want to use NEON to implement the same…

arm sse simd neon

asked Sep 12 '15 at 07:16

CJZ

votes

2 answers

A64 Neon SIMD - 256-bit comparison

I would like to compare two little-endian 256-bit values with A64 Neon instructions (asm) efficiently. Equality (=) For equality, I already got a solution: bool eq256(const UInt256 *lhs, const UInt256 *rhs) { bool result; First, load the two…

arm comparison simd neon arm64

asked Apr 20 '15 at 08:34

Etan

17,014
17
89
148

votes

4 answers

Arm NEON and poly8_t and poly16_t

I've been looking into neon optimisation with intrinsics recently and I have come across the poly8_t and poly16_t data types. I'm then left wondering what on earth they are. I've searched all across the net but so far have been unable to find ANY…

c++ c arm neon intrinsics

asked Mar 06 '14 at 12:17

Goz

61,365
24
124
204

votes

5 answers

Fastest way to test a 128 bit NEON register for a value of 0 using intrinsics?

I'm looking for the fastest way to test if a 128 NEON register contains all zeros, using NEON intrinsics. I'm currently using 3 OR operations, and 2 MOVs: uint32x4_t vr = vorrq_u32(vcmp0, vcmp1); uint64x2_t v0 =…

neon

asked Mar 13 '13 at 15:29

miluz

1,353
3
14
22

votes

3 answers

Summing 3 lanes in a NEON float32x4_t

I'm vectorizing an inner loop with ARM NEON intrinsics (llvm, iOS). I'm generally using float32x4_ts. My computation finishes with the need to sum three of the four floats in this vector. I can drop back to C floats at this point and vst1q_f32 to…

ios arm simd neon intrinsics

asked Dec 14 '12 at 00:50

Ben Zotto

70,108
23
141
204

Prev 1 2 3

…

58 59 Next