2

Assume I want to use an instruction that may be not available. And this instruction is not of those transparent fallback, it is undefined instruction when it is not available. Say it is popcnt for example.

Can I instead of using cpuid just try and call it?

If it fails, I'll catch the exception, and save this information in a bool variable, and will use a different branch further on.

Sure there would be performance penalty, but just once. Any additional disadvantages of this approach?

Alex Guteniev
  • 12,039
  • 2
  • 34
  • 79

1 Answers1

1

One major difficulty is giving correct execution for that first call.

Once you solve that by figuring out which instruction faulted and emulating it and modifying the saved task state, the problem becomes performance of a loop containing popcnt that runs 1 million iterations after you optimistically dispatched to the popcnt version of that loop.

If your whole code was written in asm (or compilers could make this code for you), it's maybe plausible but hard for a signal handler to collect all necessary state and resume execution in the other version of such a loop.

(GNU/Linux signal handlers get a non-standard with to the saved register state of the thread they're running in, so you could in theory do this there.)

Presumably this is only relevant for ahead-of-time compilation; if you're JITing you should just check CPUID ahead of time instead of building exception-handling paths.


Being able to dispatch efficiently means your code is probably already written with function pointers for functions that are multiversioned.

So the only saving here is one simple init function that your program runs once, which runs CPUID a couple times and settings all the function pointers. Doing it later lazily as needed means more cache misses unless a lot of the function pointers go unused. e.g. large-program --help.

The code for these exception / signal handlers probably wouldn't be smaller than a simple init functions. Interesting idea but overall I don't see any meaningfull benefit.


You also need to know which instruction faulted, if your program has multiple CPU features that it uses.

If you're emulating or something, you'd need to be checking that to see if it's one of your expected instructions that might raise #UD execptions / SIGILL signals. e.g. by checking the machine code at the fault address.

But if you were instead having functions keep track of which optimistic dispatch they just did (so they could detect if it didn't work), you'd need to set a variable before every dispatch so that's actually extra overhead.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • I was going to restart the loop after it fails, and without trying to identify what exactly has failed. So it would have been not that hard and inefficient as emulation. Sure then the loop should be written the way that whatever is done before potential failure is not harmful for further execution. But if I wanted emulating, this doable too: on x64 MSVC in `__except` expression all this information is available via `GetExceptionInformation()`, as well as from `AddVectoredExceptionHandler ` callback, and there's an option to return `EXCEPTION_CONTINUE_EXECUTION`. – Alex Guteniev Apr 26 '20 at 11:52
  • @AlexGuteniev: ok, so you would need effectively a try/catch, which has small but non-zero overhead in the no exception case. At least it can, restricting the compiler's ability to optimize because of the possibility of a jump happening. And limiting it only to loops that can be aborted and re-tried another way does avoid a lot of problems. – Peter Cordes Apr 26 '20 at 11:57
  • @AlexGuteniev: The faster startup advantage here seems so tiny for the never-faulting case that I don't think this is actually worth doing even if the try/catch doesn't add any per-call overhead. On Linux you'd have to make a system call to install a signal handler for SIGILL, which would be more startup overhead than just doing CPUID in user-space. If Windows just automatically has SEH then maybe that's different. Having the approach be non-portable is also a problem. – Peter Cordes Apr 26 '20 at 12:01
  • The point I'm thinking of was to have just two branches: for CPUs the program runs in reality and CPUs that are spelled out in minimum requirements. So instad of `CPUID`ing `SSE2/3/4`, `POPCNT`, whatever, just catch all invalid instruction exceptions in one place. `x64` SEH (in contrast to `x86`) claims to be near zero overhead if exception does not happen, as it uses function tables instead of SEH-specific instruction executed. – Alex Guteniev Apr 26 '20 at 12:10
  • 1
    @AlexGuteniev: If you have specialized versions for some functions, some might require only SSSE3, some SSSE4.1, some SSE4.2 and/or popcnt. And you might have an AVX+FMA version or AVX2 version as well. Old CPUs are the ones that need the most help to have acceptable performance for interactive use, so you don't want to give up on an SSSE3 version of something important just because AVX isn't available for something less often used. (e.g. CPUs without AVX are still widespread, Pentium/Celeron versions of current Intel uarches. Also low-power CPUs like Goldmont.) – Peter Cordes Apr 26 '20 at 12:21
  • Reading about Adler Lake AVX-512 disabled when E-cores are enabled. A heterogenous system with threads migrating across cores may be a good candidate where catching #UD is useful, if some instructions are implemented, but only on P cores. – Alex Guteniev Nov 11 '21 at 07:01
  • @AlexGuteniev: No such heterogeneous systems actually exist, though, because if CPUID reports AVX-512 is available, stuff like `memcpy` will use it (with 256-bit vectors) in every process in existing binaries, even if only to avoid `vzeroupper` (and the resulting TSX abort if called in a transaction) by only using ymm16..31. That's why Alder Lake made that choice, as discussed in [What are performance and efficiency cores in Intel's 12th Generation Alder lake CPU Line?](https://superuser.com/a/1677779) and [@Bee's comments](https://stackoverflow.com/posts/comments/121850623) – Peter Cordes Nov 11 '21 at 07:08
  • If the x86 software ecosystem ever evolves in a direction of supporting heterogeneous ISA extensions, not just hetero microarchs like Alder Lake, likely *the OS* would be triggering migration in the #UD handler, not delivering SIGILL to user-space. Were you picturing that user-space would have to catch SIGILL and set its thread-affinity mask to only include cores that support the extensions it wants? You don't want this happening much, so there'd need to be some way for the OS (or CPUID?) to communicate which extensions are only available on some cores and should only be used if important. – Peter Cordes Nov 11 '21 at 07:08
  • No, I were picturing that the SW knows that some ISA extension may suddenly become unavailable, and the SW may still use them only if it has UD handler with a fallback. Though maybe setting thread affinity may be another option. – Alex Guteniev Nov 11 '21 at 07:13
  • @AlexGuteniev: Oh, so you're picturing that before calling a function with an AVX-512 loop, you'd call `sigsetjmp` or something so a SIGILL handler could get back out of a function containing AVX-512 instructions back to a well-defined state? After some unknown portion of the side effects on any memory the function modifies, so you couldn't use this with in-place updates that weren't idempotent, e.g. XORing two buffers together, unless you also were updating a progress indicator before and after every memory write. (In hand-written asm that could be a register, but in C a volatile size_t?) – Peter Cordes Nov 11 '21 at 07:18
  • Yes. Agree it has limited usage, still `memcpy` or summing two arrays into a third would work, if on #UD just start the work over. If switches are rare, the long way of handling this may be worth the benefit of using faster instructions. – Alex Guteniev Nov 11 '21 at 07:29
  • @AlexGuteniev: Before such a CPU would be practical, software would need a way to avoid AVX-512 in the first place for things where it doesn't provide a *big* benefit, or to find out when it's currently running on an E-core in a way that's more efficient than faulting and handling a user-space signal. And to know when it's worth trying AVX-512 instructions again without running a slow CPUID instruction (which is extra slow in a VM because it's *always* a vm-exit). Perhaps reading a byte from a VDSO data page exported by the kernel, which has per-core data instead of per-thread or per-proc? – Peter Cordes Nov 11 '21 at 07:42
  • Things that are usually bottlenecked on L3 or DRAM bandwidth even with AVX2 often don't benefit much. And the time it takes to get into the kernel, do Spectre mitigation stuff before running any kernel code, save/restore integer regs, deliver a user-space signal, maybe make a sigreturn system call, before eventually restarting the loop with the AVX2 version, is pretty costly. So that needs to be avoided when we're already on an E-core. Maybe some infrastructure to allow some user-writable pages to have different data on E-cores vs. P-cores, so you can keep function ptrs there for dispatch. – Peter Cordes Nov 11 '21 at 07:46
  • If we go beyond AVX-512 and further into heterogenous CPU, I think heterogenous ISA should somehow work. If handling failure isn't a good idea then maybe just some kind of enter/exit instruction, similar to `vzeroupper` to exit AVX2, that would mark regions where advanced instructions is used. – Alex Guteniev Nov 11 '21 at 07:56
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/239109/discussion-between-peter-cordes-and-alex-guteniev). – Peter Cordes Nov 11 '21 at 07:57