Questions tagged [intrinsics]

Intrinsics are functions used in compiled languages to trigger the execution specific processor instructions, typically those outside the scope of the compiled language itself.

Intrinsic functions are pseudo-functions used by compilers to represent functionality that is outside the current scope of the language; often times, they may later be incorporated into a language. Some examples are simd and atomic instructions. The compiler has knowledge of the operations of the intrinsics and is able to optimize register use to take advantage of them.

A compiler library usually has actual implementations of the functions, which are used if a lower class CPU (or completely different) is detected at run-time or compile time.

Compiler intrinsics are very similar to inline-assembly. Inline assembler has notations to denote permissible input and output registers as well as clobber values; unless the compiler implicitly parses the inline assembly. With a compiler intrinsic, the register use is already built into the compiler and a developer doesn't need to know as many low level details; although it is often helpful to have some low level assembler knowledge to guide profiling and optimization.

Related tags:

1314 questions
21
votes
2 answers

Why and when to use __noop?

I was reading about __noop and the MSDN example is #if DEBUG #define PRINT printf_s #else #define PRINT __noop #endif int main() { PRINT("\nhello\n"); } and I don't see the gain over just having an empty macro: #define PRINT The…
Luchian Grigore
  • 253,575
  • 64
  • 457
  • 625
20
votes
1 answer

Undocumented intrinsic routines

Delphi has this list: Delphi Intrinsic Routines But that list is incomplete. What are the 7 undocumented intrinsic functions, since when and what is their purpose?
Johan
  • 74,508
  • 24
  • 191
  • 319
20
votes
3 answers

What's the difference between logical SSE intrinsics?

Is there any difference between logical SSE intrinsics for different types? For example if we take OR operation, there are three intrinsics: _mm_or_ps, _mm_or_pd and _mm_or_si128 all of which do the same thing: compute bitwise OR of their operands.…
user283145
20
votes
2 answers

How to sum __m256 horizontally?

I would like to horizontally sum the components of a __m256 vector using AVX instructions. In SSE I could use _mm_hadd_ps(xmm,xmm); _mm_hadd_ps(xmm,xmm); to get the result at the first component of the vector, but this does not scale with the 256…
Yoav
  • 5,962
  • 5
  • 39
  • 61
19
votes
6 answers

How to use MSVC intrinsics to get the equivalent of this GCC code?

The following code calls the builtin functions for clz/ctz in GCC and, on other systems, has C versions. Obviously, the C versions are a bit suboptimal if the system has a builtin clz/ctz instruction, like x86 and ARM. #ifdef __GNUC__ #define…
Dark Shikari
  • 7,941
  • 4
  • 26
  • 38
19
votes
1 answer

How to implement "_mm_storeu_epi64" without aliasing problems?

(Note: Although this question is about "store", the "load" case has the same issues and is perfectly symmetric.) The SSE intrinsics provide an _mm_storeu_pd function with the following signature: void _mm_storeu_pd (double *p, __m128d a); So if I…
Nemo
  • 70,042
  • 10
  • 116
  • 153
19
votes
2 answers

How to rotate an SSE/AVX vector

I need to perform a rotate operation with as little clock cycles as possible. In the first case let's assume __m128i as source and dest type: source: || A0 || A1 || A2 || A3 || dest: || A1 || A2 || A3 || A0 || dest =…
user1584773
  • 699
  • 7
  • 19
18
votes
2 answers

Reference manual/tutorial for x86 SIMD intrinsics?

I'm looking into using these to improve the performance of some code but good documentation seems hard to find for the functions defined in the *mmintrin.h headers, can anybody provide me with pointers to good info on these? EDIT: particularly…
BD at Rivenhill
  • 12,395
  • 10
  • 46
  • 49
17
votes
3 answers

Why does does SSE set (_mm_set_ps) reverse the order of arguments

I recently noticed that _m128 m = _mm_set_ps(0,1,2,3); puts the 4 floats into reverse order when cast to a float array: (float*) p = (float*)(&m); // p[0] == 3 // p[1] == 2 // p[2] == 1 // p[3] == 0 The same happens with a union { _m128 m;…
Inverse
  • 4,408
  • 2
  • 26
  • 35
17
votes
1 answer

Do I get a performance penalty when mixing SSE integer/float SIMD instructions

I've used x86 SIMD instructions (SSE1234) in the form of intrinsics quite a lot lately. What I found frustrating is that the SSE ISA has several simple instructions that are available only for floats or only for integers, but in theory should…
user283145
17
votes
5 answers

Intrinsics for CPUID like informations?

Considering that I'm coding in C++, if possible, I would like to use an Intrinsics-like solution to read useful informations about the hardware, my concerns/considerations are: I don't know assembly that well, it will be a considerable investment…
user2485710
  • 9,451
  • 13
  • 58
  • 102
17
votes
5 answers

Is it possible to cast floats directly to __m128 if they are 16 byte aligned?

Is it safe/possible/advisable to cast floats directly to __m128 if they are 16 byte aligned? I noticed using _mm_load_ps and _mm_store_ps to "wrap" a raw array adds a significant overhead. What are potential pitfalls I should be aware of? EDIT…
dtech
  • 47,916
  • 17
  • 112
  • 190
16
votes
1 answer

Divide by floating-point number using NEON intrinsics

I'm processing an image by four pixels at the time, this on a armv7 for an Android application. I want to divide a float32x4_t vector by another vector but the numbers in it are varying from circa 0.7 to 3.85, and it seems to me that the only way to…
Darkmax
  • 187
  • 1
  • 9
16
votes
1 answer

what's the difference between _mm256_lddqu_si256 and _mm256_loadu_si256

I had been using _mm256_lddqu_si256 based on an example I found online. Later I discovered _mm256_loadu_si256. The Intel Intrinsics guide only states that the lddqu version may perform better when crossing a cache line boundary. What might be the…
Jimbo
  • 2,886
  • 2
  • 29
  • 45
16
votes
0 answers

Costs of new AVX512 instruction - Scatter store

I'm playing around with the new AVX512 instruction sets and I try to understand how they work and how one can use them. What I try is to interleave specific data, selected by a mask. My little benchmark loads x*32 byte of aligned data from memory…
Hymir
  • 811
  • 1
  • 10
  • 20
1
2
3
87 88