Questions tagged [neon]

NEON is a vector-processing instruction set for ARM processors. Please use this tag together with [arm] if asking about the AArch32 version of NEON (to run on 32-bit ARM processors), or [arm64] for AArch64. See also the [simd] tag.

NEON is a vector-processing instruction set for ARM processors. It's also known as Advanced SIMD (Single Instruction Multiple Data).

NEON can be used on either 32-bit or 64-bit ARM processors, as part of the AArch32 or AArch64 architectures respectively. However, there are significant differences between the AArch32 and AArch64 versions of NEON (register usage, instruction mnemonics, instruction availability), so please use this tag together with either for AArch32, or for AArch64.

The tag may also be appropriate, especially for questions about SIMD algorithms that may be implemented with NEON.

Don't forget to include a tag for the programming language you are coding in, perhaps , or . In the latter cases, consider the tags or for how you access the instructions.

More information at

  1. Neon page in ARM website
  2. Wikipedia article on ARM
885 questions
8
votes
2 answers

Which one is better, gcc or armcc for NEON optimizations?

Refering to @auselen's answer here: Using ARM NEON intrinsics to add alpha and permute, looks like armcc compiler is far more better than the gcc compiler for NEON optimizations. Is this really true? I haven't really tried armcc compiler. But I got…
nguns
  • 440
  • 6
  • 21
8
votes
0 answers

attempt to convert SSE2 Fast Corner score code to ARM Neon

I was trying to port some SSE2 code (fast corner detector score computation) using ARM Neon instruction. The code is quite simple at first sight but the results are different for some reason. The thing is that sometimes the difference may be quite…
inspirit
  • 329
  • 3
  • 7
8
votes
2 answers

Neon equivalent to SSE intrinsics

I'm trying to convert a c code to an optimized one using neon intrinsics. Here is the c codes that operate over 2 operants not over vectors of operants. uint16_t mult_z216(uint16_t a,uint16_t b){ unsigned int c1 = a*b; if(c1) { int…
Kami
  • 5,959
  • 8
  • 38
  • 51
8
votes
3 answers

128-bit rotation using ARM Neon intrinsics

I'm trying to optimize my code using Neon intrinsics. I have a 24-bit rotation over a 128-bit array (8 each uint16_t). Here is my c code: uint16_t rotated[8]; uint16_t temp[8]; uint16_t j; for(j = 0; j < 8; j++) { //Rotation <<< 24 over 128…
Kami
  • 5,959
  • 8
  • 38
  • 51
8
votes
2 answers

Android CPU ARM architectures

We have a Android CPU dependent code and I would like to see how many devices used by customers are ARMv6/ARMv7, if there are still ARM v5, how many of ARMv6 have VFP, what is the Tegra or Neon percentage. Any tips where such statistics could be…
STeN
  • 6,262
  • 22
  • 80
  • 125
7
votes
3 answers

Loop takes more cycles to execute than expected in an ARM Cortex-A72 CPU

Consider the following code, running on an ARM Cortex-A72 processor (optimization guide here). I have included what I expect are resource pressures for each execution port: Instruction B I0 I1 M L S F0 F1 .LBB0_1: ldr q3, [x1],…
swineone
  • 2,296
  • 1
  • 18
  • 32
7
votes
3 answers

Is there an advantage of specifying "-mfpu=neon-vfpv3" over "-mfpu=neon" for ARMs with separate pipelines?

My Zynq-7000 ARM Cortex-A9 Processor has both the NEON and the VFPv3 extension and the Zynq-7000-TRM says that the processor is configured to have "Independent pipelines for VFPv3 and advanced SIMD instructions". So far I compiled my programs with…
Johannes Schaub - litb
  • 496,577
  • 130
  • 894
  • 1,212
7
votes
4 answers

ARM Neon: How to convert from uint8x16_t to uint8x8x2_t?

I recently discovered about the vreinterpret{q}_dsttype_srctype casting operator. However this doesn't seem to support conversion in the data type described at this link (bottom of the page): Some intrinsics use an array of vector types of the…
Antonio
  • 19,451
  • 13
  • 99
  • 197
7
votes
1 answer

Detect ARM NEON availability in the preprocessor?

According to the ARM ARM, __ARM_NEON__ is defined when Neon SIMD instructions are available. I'm having trouble getting GCC to provide it. Neon available on this BananaPi Pro dev board running Debian 8.2: $ cat /proc/cpuinfo | grep neon Features …
jww
  • 97,681
  • 90
  • 411
  • 885
7
votes
5 answers

Fast conversion of 16-bit big-endian to little-endian in ARM

I need to convert big arrays of 16-bit integer values from big-endian to little-endian format. Now I use for conversion the following function: inline void Reorder16bit(const uint8_t * src, uint8_t * dst) { uint16_t value = *(uint16_t*)src; …
user5480682
7
votes
1 answer

How to convert _mm_shuffle_ps SSE intrinsic to NEON intrinsic?

I am trying to convert codes written in SSE to NEON SIMD and got stuck because of the _mm_shuffle_ps SSE intrinsic. Here is the code: b = _mm_shuffle_ps(a, b, 136); a, b, c are all the __m128 registers. Now I want to use NEON to implement the same…
CJZ
  • 199
  • 2
  • 9
7
votes
2 answers

A64 Neon SIMD - 256-bit comparison

I would like to compare two little-endian 256-bit values with A64 Neon instructions (asm) efficiently. Equality (=) For equality, I already got a solution: bool eq256(const UInt256 *lhs, const UInt256 *rhs) { bool result; First, load the two…
Etan
  • 17,014
  • 17
  • 89
  • 148
7
votes
4 answers

Arm NEON and poly8_t and poly16_t

I've been looking into neon optimisation with intrinsics recently and I have come across the poly8_t and poly16_t data types. I'm then left wondering what on earth they are. I've searched all across the net but so far have been unable to find ANY…
Goz
  • 61,365
  • 24
  • 124
  • 204
7
votes
5 answers

Fastest way to test a 128 bit NEON register for a value of 0 using intrinsics?

I'm looking for the fastest way to test if a 128 NEON register contains all zeros, using NEON intrinsics. I'm currently using 3 OR operations, and 2 MOVs: uint32x4_t vr = vorrq_u32(vcmp0, vcmp1); uint64x2_t v0 =…
miluz
  • 1,353
  • 3
  • 14
  • 22
7
votes
3 answers

Summing 3 lanes in a NEON float32x4_t

I'm vectorizing an inner loop with ARM NEON intrinsics (llvm, iOS). I'm generally using float32x4_ts. My computation finishes with the need to sum three of the four floats in this vector. I can drop back to C floats at this point and vst1q_f32 to…
Ben Zotto
  • 70,108
  • 23
  • 141
  • 204