Questions tagged [intrinsics]

Intrinsics are functions used in compiled languages to trigger the execution specific processor instructions, typically those outside the scope of the compiled language itself.

Intrinsic functions are pseudo-functions used by compilers to represent functionality that is outside the current scope of the language; often times, they may later be incorporated into a language. Some examples are simd and atomic instructions. The compiler has knowledge of the operations of the intrinsics and is able to optimize register use to take advantage of them.

A compiler library usually has actual implementations of the functions, which are used if a lower class CPU (or completely different) is detected at run-time or compile time.

Compiler intrinsics are very similar to inline-assembly. Inline assembler has notations to denote permissible input and output registers as well as clobber values; unless the compiler implicitly parses the inline assembly. With a compiler intrinsic, the register use is already built into the compiler and a developer doesn't need to know as many low level details; although it is often helpful to have some low level assembler knowledge to guide profiling and optimization.

Related tags:

1314 questions
12
votes
2 answers

Does Clang have something like #pragma GCC target?

I have some code written that uses AVX intrinsics when they are available on the current CPU. In GCC and Clang, unlike Visual C++, in order to use intrinsics, you must enable them on the command line. The problem with GCC and Clang is that when you…
Myria
  • 3,372
  • 1
  • 24
  • 42
12
votes
1 answer

Compile C++ code with AVX2/AVX512 intrinsics on AVX

I have production code that has kernels implemented for various SIMD instruction sets, including AVX, AVX2, and AVX512. The code can be compiled on the target machine for the target machine with something like ./configure --enable-proc=AVX…
Martin Ueding
  • 8,245
  • 6
  • 46
  • 92
12
votes
1 answer

is there an inverse instruction to the movemask instruction in intel avx2?

The movemask instruction(s) take an __m256i and return an int32 where each bit (either the first 4, 8 or all 32 bits depending on the input vector element type) is the most significant bit of the corresponding vector element. I would like to do the…
orm
  • 2,835
  • 2
  • 22
  • 35
12
votes
1 answer

How does _mm_mwait work?

How does _mm_mwait from pmmintrin.h work? (I mean not the asm for it, but action and how this action is taken in NUMA systems. The store monitoring is easy to implement only on bus-based SMP systems with snooping of bus.) What processors does…
osgx
  • 90,338
  • 53
  • 357
  • 513
12
votes
3 answers

Emulating shifts on 32 bytes with AVX

I am migrating vectorized code written using SSE2 intrinsics to AVX2 intrinsics. Much to my disappointment, I discover that the shift instructions _mm256_slli_si256 and _mm256_srli_si256 operate only on the two halves of the AVX registers separately…
user1196549
12
votes
3 answers

Initializing an __m128 type from a 64-bit unsigned int

The _mm_set_epi64 and similar *_epi64 instructions seem to use and depend on __m64 types. I want to initialize a variable of type __m128 such that the upper 64 bits of it are 0, and the lower 64 bits of it are set to x, where x is of type uint64_t…
Gideon
  • 433
  • 4
  • 15
12
votes
3 answers

Using SSE instructions with gcc without inline assembly

I am interested in using the SSE vector instructions of x86-64 with gcc and don't want to use any inline assembly for that. Is there a way I can do that in C? If so, can someone give me an example?
pythonic
  • 20,589
  • 43
  • 136
  • 219
11
votes
1 answer

Fallback implementation for conflict detection in AVX2

AVX512CD contains the intrinsic _mm512_conflict_epi32(__m512i a) it returns a vector where for every element in a a bit is set if it has the same value. Is there a way to do something similar in AVX2? I'm not interested in the extact bits I just…
Christoph Diegelmann
  • 2,004
  • 15
  • 26
11
votes
4 answers

Most efficient way to store 4 dot products into a contiguous array in C using SSE intrinsics

I am optimizing some code for an Intel x86 Nehalem micro-architecture using SSE intrinsics. A portion of my program computes 4 dot products and adds each result to the previous values in a contiguous chunk of an array. More specifically, tmp0 =…
Sam
  • 417
  • 1
  • 6
  • 13
11
votes
1 answer

How to merge a scalar into a vector without the compiler wasting an instruction zeroing upper elements? Design limitation in Intel's intrinsics?

I don't have a particular use-case in mind; I'm asking if this is really a design flaw / limitation in Intel's intrinsics or if I'm just missing something. If you want to combine a scalar float with an existing vector, there doesn't seem to be a way…
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
11
votes
4 answers

How do I reorder vector data using ARM Neon intrinsics?

This is specifically related to ARM Neon SIMD coding. I am using ARM Neon instrinsics for certain module in a video decoder. I have a vectorized data as follows: There are four 32 bit elements in a Neon register - say, Q0 - which is of size 128 bit.…
goldenmean
  • 18,376
  • 54
  • 154
  • 211
11
votes
2 answers

Fast calculate hamming distance in C

I read the Wikipedia article on Hamming Weight and noticed something interesting: It is thus equivalent to the Hamming distance from the all-zero string of the same length. For the most typical case, a string of bits, this is the number of 1's in…
haneefmubarak
  • 1,911
  • 1
  • 21
  • 32
11
votes
1 answer

Vectorizing Modular Arithmetic

I'm trying to write some reasonably fast component-wise vector addition code. I'm working with (signed, I believe) 64-bit integers. The function is void addRq (int64_t* a, const int64_t* b, const int32_t dim, const int64_t q) { for(int i = 0; i…
crockeea
  • 21,651
  • 10
  • 48
  • 101
11
votes
1 answer

How to load a pixel struct into an SSE register?

I have a struct of 8-bit pixel data: struct __attribute__((aligned(4))) pixels { char r; char g; char b; char a; } I want to use SSE instructions to calculate certain things on these pixels (namely, a Paeth transformation). How can…
fuz
  • 88,405
  • 25
  • 200
  • 352
11
votes
1 answer

What's the difference between __popcnt() and _mm_popcnt_u32()?

MS Visual C++ supports 2 flavors of the popcnt instruction on CPUs with SSE4.2: __popcnt() _mm_popcnt_u32() The only difference I found was that the docs for __popcnt() are marked as "Microsoft Specific", and _mm_popcnt_u32() seems to be an…
Adi Shavit
  • 16,743
  • 5
  • 67
  • 137