2

While examining the instruction set for Intel x86 processors I noticed there are 'intuitive' instructions like 'mov', 'add', 'mul' ... while others seem a bit unnatural like 'sete'. The question is more out of curiosity rather than practical concerns: why would designers chose to implement particular execution scenarios in single instructions? Do you know any reading material that would explain such design decisions?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Eugen
  • 2,292
  • 3
  • 29
  • 43
  • Probably to enhance support for high-level language compilers. Also, it's in the nature of instruction sets that once you decide to implement one conditional-upon-flag instruction, you get a lot because the flags are all in one register and so. once you have your SETZ/SETE, you may as well document the extra instructions that the silicon gave your for free. – Martin James Apr 30 '12 at 09:05
  • As a side-note there's a very good reason for SETcc, it's the father of CMOVcc, which was added later. If you have two small code paths separated by an unpredictable branch then it's often quicker to execute both rather than handle a branch mispredict 50% of the time. With CMOVcc you can do mov eax,result1; cmovne ecx,result2, you can emulate this with SETcc by anding the results with masks and oring them. – jleahy Sep 13 '12 at 10:32
  • @jleahy: Could you please give an example snippet of the SETcc emulation of CMOVcc? – zx485 Jan 13 '16 at 08:36
  • @zx485: Something like this: `setz edx; dec edx; mov eax,result1; and eax, edx; not edx; and result2,edx; or eax,result2`. The trick is setz gives you 0 or 1, then decrementing it gives you 0x0000000 or 0xFFFFFFFF. In the end you have `output=(result1&mask)|(result2&(~mask))`. – jleahy Jan 13 '16 at 13:47
  • @jleahy: Thanks. Very interesting. Maybe of use in some cases where CMOVcc is not applicable. – zx485 Jan 14 '16 at 09:47
  • @zx485 Yes, or in older processors where CMOV isn't available. It's also very good for things like `bool x = (y > 0);` in C++. – jleahy Jan 14 '16 at 14:13

3 Answers3

7

Some criteria that designers use to decide if a "particular execution scenario" is a reasonable candidate for an instruction:

  1. stateless behavior - The operation must depend only on operands or otherwise visible machine state (e.g. arithmetic flags) at the time of execution. No hidden state allowed. This restriction rules out non-blocking instructions that stay busy after the instruction appears to have completed.

  2. limited memory touching - Memory access is often a rate limiter. Other than improving code density, it makes no sense to combine discrete operations into one big instruction if both perform the same due to memory bottlenecks.

  3. computationally interesting - The new instruction should do something more efficiently than otherwise possible. The x86 AES instructions are extreme examples. Relatively simple operations like bit swizzles matter too if they happen often enough.

  4. business value - Does the silicon area and validation effort to implement the instruction pay for itself?

  5. compatibility value - Last, but not least, many instructions exist for no other reason than to support legacy software.

srking
  • 4,512
  • 1
  • 30
  • 46
  • Thank you for a detailed answer. I wonder if there is a paper that describes instruction_utility/silicon_area coefficient :) – Eugen May 10 '12 at 10:15
5

In the case of sete, it was probably a matter of practical experience with code written in the instruction set. At least if memory serves, sete was added as of the 386, so by then the instruction set had been in active use for a few years. At a guess, they probably spent some time looking through code to find things that were done a lot, but not directly supported in the instruction set. They would probably screen those to find ones that would be easy to make a lot more efficient by supporting them directly in the CPU.

A lot of cases are rather similar to that -- work is basically prototyped in software to find a design that's reasonably flexible, efficient and simple to implement. Then, when the design is relatively polished, the CPU designers look it over and see whether they can't make it at least a little more efficient by implementing (at least parts of) it in hardware.

Most of the so-called RISC processors were designed by collecting statistics on the code generated from source code with existing compilers on existing processors. Then they looked through the frequency of instruction use, and (attempted to) optimize those that were used a lot, and simply dropped those that weren't used very much.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
4

There are at least two possible sequences to achieve the code. Here's my analysis of them

; "classic" code
 xor eax,eax     ; zero AL in case we don't write it.
 cmp edx,15
 jne past
  mov al,1        ; AL = (edx==15)
past:

And

; "evolved" code
; xor eax,eax optional to get a bool zero-extended to full register
cmp edx,15
sete al        ; AL = 0 or 1 according to edx==15

It is a bit of a mindset that a conditional, simple assignment "should" involve a conditional jump on the opposite condition, even if it only jumps around a single instruction. But it is - in your own words - a particular execution scenario that occurs frequently so if a better alternative were available, why not?

When code executes there are many factors affecting execution speed. Two of them are the time it takes for the result of a comparison/arithmetic/boolean operation to reach the flags register and the other the execution penalty when a jump is taken (I'm over-simplifying this a bit).

So the classic code will either execute a move or take a jump. The former will probably be executed in parallel with other code and the latter may cause the prefetcher to load data from the new position resulting in wait states. The processor's branch prediction may be involved and may - depending on a lot of factors - predict incorrectly which incurs additional penalties.

In the evolved case, code prefetching is not affected at all which is good for execution speed. Also the sete sequence will probably fit in fewer bytes than the mov+jne combo which means that relatively less code cache line capacity/work will be involved in the execution which means there will be relatively more data cache capacity/work will be freed up as well. It the contents of the assignment isn't needed right away the sete could be rescheduled to a position where it blends in better (execution-wise) with the surrounding code. This rescheduling could be performed explicitly (by the compiler) or implicitly (by the CPU itself). Static scheduling (by the compiler) is limited because most x86 integer instruction affect FLAGS so they can't be separated far.

For normal (usually un-tuned), bloated application code the use of instructions such as this will have little impact on overall performance. In highly specialized, hand tuned code with very tight loops the difference between executing within three rather than four or five cache lines could make an enormous difference, especially if multiple copies of the code are running on different cores.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Olof Forshell
  • 3,169
  • 22
  • 28