What is the SSE2 assembly equivalent of intrinsics?

Question

I'm using Fasm (assembly) and I am looking for SSE2 assembly instructions equivalents of these intrinsics instructions:

_mm_set1_epi8
_mm_cmpeq_epi8
_mm_movemask_epi8

Where do I get them (web site, pdf...) ?

Also see [Intel Intrinsics Guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide). For each intrinsic the assembly instruction is listed. — jww, Feb 04 '19 at 15:43

Paul R · Answer 1 · 2021-10-13T16:36:24.710

8

Use the Intel Intrinsics Guide but note that some intrinisics do not map to a single instruction, e.g. _mm_set1_epi8. For most intrinsics though the description lists the corresponding machine instruction.

You can also use the insanely useful Compiler Explorer to see generated code for given intrinsics, e.g. this example for _mm_set1_epi8.

edited Oct 13 '21 at 16:36

answered Jan 12 '19 at 11:57

Paul R

208,748
37
389
560

score 6 · Answer 2 · answered Jan 12 '19 at 13:20

Instead of messing around with the intrinsics documentation, look at Intel's asm documentation in the first place, in their x86 Software Developer Manual vol.2. Or HTML extracts of just the instruction entries, without the intro and appendices, on https://www.felixcloutier.com/x86/index.html. e.g. https://www.felixcloutier.com/x86/PCMPEQB:PCMPEQW:PCMPEQD.html

(Intel's asm manual entries list the intrinsics for that instruction at the bottom of the entry. The lists are a cluttered mess now that AVX512 is part of the main PDF, but still you can check your guess / memory in the other direction if you already guessed what instruction would be used for an intrinsic and looked it up. Or if you search in the full PDF version, you'll get a hit on the intrinsic name, for intrinsics that map directly to one instruction like _mm_cmpeq_epi8 but not set1)

It's better / more detailed than their intrinsics documentation (e.g. the Operation section always exists, and is usually more specific). Plus, it shows you what order the operands go in. This usually matches the intrinsic, but I seem to remember a case where it didn't, maybe with a shuffle. And of course there's vfmadd132ps vs. vfmadd213ps vs. vfmadd231ps which differ in which of the addend or one of the multiplicands is the destination and which can be memory.

It also shows you which operand can be memory. It's not always the last one, e.g. VBLENDVPS xmm1, xmm2, xmm3/m128, xmm4 (because the last operand is encoded in an immediate byte, instead of being implicitly xmm0 like the non-VEX version). Also, pmovzxbd xmm1, dword [rdi] and others are useful as a narrow load (which doesn't require alignment because it's less than 16 bytes), but you'd never know that from intrinsics that only provide it with a __m128i source. Compilers can't always optimize into a memory operand after you use _mm_cvtsi32_si128 (int a).

And there's pblendvb where the non-VEX form is PBLENDVB xmm1, xmm2/m128, <XMM0>, implicitly using XMM0 for the blend-control vector. Intrinsics hide this as well, so you'd get confusing errors if you tried to write pblendvb xmm1, xmm8, xmm7.

Agner Fog's asm optimization guide also has a chapter on SIMD, with some pretty good tables of data-movement instructions that are useful for different kinds of tasks.

See also the SO x86 tag wiki for more links.

I find the asm mnemonics easier to remember; they're shorter and have slightly fewer weird differences like shuffle vs. permute in the naming (most of the time, until AVX...). More importantly, I tend to think in terms of asm and then write intrinsics that will let the compiler compile efficiently.

CPU latency/throughput / execution-port information is all by mnemonic, not intrinsic (Agner Fog's tables, instlatx64, and http://uops.info/), so you have to know those names to get into really low-level performance details, and to check if the compiler did a good job with your code, and to look at perf record / perf report profiling results to maybe figure out why there's a hot-spot somewhere.

Intel has throughput/latency numbers on their intrinsics guide, but not execution port, so you can't know whether or not two throughtput=1 instructions can run in the same cycle as each other, making it not very useful.

What is the SSE2 assembly equivalent of intrinsics?

2 Answers2