Questions tagged [neon]

NEON is a vector-processing instruction set for ARM processors. Please use this tag together with [arm] if asking about the AArch32 version of NEON (to run on 32-bit ARM processors), or [arm64] for AArch64. See also the [simd] tag.

NEON is a vector-processing instruction set for ARM processors. It's also known as Advanced SIMD (Single Instruction Multiple Data).

NEON can be used on either 32-bit or 64-bit ARM processors, as part of the AArch32 or AArch64 architectures respectively. However, there are significant differences between the AArch32 and AArch64 versions of NEON (register usage, instruction mnemonics, instruction availability), so please use this tag together with either for AArch32, or for AArch64.

The tag may also be appropriate, especially for questions about SIMD algorithms that may be implemented with NEON.

Don't forget to include a tag for the programming language you are coding in, perhaps , or . In the latter cases, consider the tags or for how you access the instructions.

More information at

  1. Neon page in ARM website
  2. Wikipedia article on ARM
885 questions
10
votes
1 answer

SIMD optimization of cvtColor using ARM NEON intrinsics

I'm working on a SIMD optimization of BGR to grayscale conversion which is equivalent to OpenCV's cvtColor() function. There is an Intel SSE version of this function and I'm referring to it. (What I'm doing is basically converting SSE code to NEON…
S.Sato
  • 103
  • 1
  • 7
10
votes
4 answers

iPhone detecting processor model / NEON support

I'm looking for a way to differentiate at runtime between devices equipped with the new ARM processor (such as iPhone 3GS and some iPods 3G) and devices equipped with the old ARM processors. I know I can use uname() to determine the device model,…
yonilevy
  • 5,320
  • 3
  • 31
  • 27
10
votes
6 answers

SSE _mm_movemask_epi8 equivalent method for ARM NEON

I decided to continue Fast corners optimisation and stucked at _mm_movemask_epi8 SSE instruction. How can i rewrite it for ARM Neon with uint8x16_t input?
inspirit
  • 329
  • 3
  • 7
9
votes
1 answer

How to stop GCC from breaking my NEON intrinsics?

I need to write optimized NEON code for a project and I'm perfectly happy to write assembly language, but for portability/maintainability I'm using NEON instrinsics. This code needs to be as fast as possible, so I'm using my experience in ARM…
BitBank
  • 8,500
  • 3
  • 28
  • 46
9
votes
1 answer

How to initialize const float32x4x4_t (ARM NEON intrinsic, GCC)?

I can initialize float32x4_t like this: const float32x4x4_t zero = { 0.0f, 0.0f, 0.0f, 0.0f }; But this code makes an error Incompatible types in initializer: const float32x4x4_t one = { 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, …
eonil
  • 83,476
  • 81
  • 317
  • 516
9
votes
1 answer

How do Android programs make use of NEON SIMD?

I've been learning up a little on the cpu features and stumbled upon NEON. From what I've read, it looks like NEON requires specific programming to use this, but is this completely true, or do the cpus that have this feature still find ways to…
Tam
  • 1,189
  • 10
  • 15
8
votes
5 answers

Optimizing RGBA8888 to RGB565 conversion with NEON

I'm trying to optimize an image format conversion on iOS using the NEON vector instruction set. I assumed this would map well to that because it processes a bunch of similar data. My attempts haven't gone that well, though, achieving only a marginal…
Andrew Pouliot
  • 5,423
  • 1
  • 30
  • 34
8
votes
1 answer

sse/avx equivalent for neon vuzp

Intel's vector extensions SSE, AVX, etc. provide two unpack operations for each element size, e.g. SSE intrinsics are _mm_unpacklo_* and _mm_unpackhi_*. For 4 elements in a vector, it does this: inputs: (A0 A1 A2 A3) (B0 B1 B2 B3) unpacklo/hi:…
Ralf
  • 1,203
  • 1
  • 11
  • 20
8
votes
3 answers

128bit hash comparison with SSE

In my current project, I have to compare 128bit values (actually md5 hashes) and I thought it would be possible to accelerate the comparison by using SSE instructions. My problem is that I can't manage to find good documentation on SSE…
fokenrute
  • 739
  • 6
  • 17
8
votes
1 answer

ARM Cortex-A8: How to make use of both NEON and vfpv3

I'm using Cortex-A8 processor and I'm not understanding how to use the -mfpu flag. On the Cortex-A8 there are both vfpv3 and neon co-processors. Previously I was not knowing how to use neon so I was only using gcc -marm -mfloat-abi=softfp…
HaggarTheHorrible
  • 7,083
  • 20
  • 70
  • 81
8
votes
2 answers

Optimizing horizontal boolean reduction in ARM NEON

I'm experimenting with a cross-platform SIMD library ala ecmascript_simd aka SIMD.js, and part of this is providing a few "horizontal" SIMD operations. In particular, the API that library offers includes any() -> bool and all()…
huon
  • 94,605
  • 21
  • 231
  • 225
8
votes
2 answers

(opencv rc1) What causes Mat multiplication to be 20x slower than per-pixel multiplication?

// 700 ms cv::Mat in(height,width,CV_8UC1); in /= 4; Replaced with //40 ms cv::Mat in(height,width,CV_8UC1); for (int y=0; y < in.rows; ++y) { unsigned char* ptr = in.data + y*in.step1(); for (int x=0; x < in.cols; ++x) { ptr[x]…
Boyko Perfanov
  • 3,007
  • 18
  • 34
8
votes
3 answers

Compacting data in buffer from 16 bit per element to 12 bits

I'm wondering if there is any chance to improve performance of such compacting. The idea is to saturate values higher than 4095 and place each value every 12 bits in new continuous buffer. Just like that: Concept: Convert: Input buffer:…
Piotr Nowak
  • 125
  • 1
  • 1
  • 7
8
votes
1 answer

NEON intrinsic types work in C but throw invalid arguments error in C++

I have problems with using NEON intrinsics and inline assembly in Android NDK. NEON types like float32x4_t give an "invalid arguments" error when compiling C++ code with GCC 4.6 and 4.8, however, the code compiles fine if compiled as C. For example,…
Triang3l
  • 1,230
  • 9
  • 29
8
votes
1 answer

Maximum optimization of element wise multiplication via ARM NEON assembly

I'm optimizing an element wise multiplication of two single dimensional arrays for a dual Cortex-A9 processor. Linux is running on the board and I'm using the GCC 4.5.2 compiler. So the following is my C++ inline assembler function. src1, src2 and…
HyraxK
  • 89
  • 3
1 2
3
58 59