2

I was wondering if it would be possible to use SSE in parallel with x87. So consider the following pseudo code,

1: sse_insn 2: x87_insn

Would the pipeline execute 1 and 2 in parallel assuming they can be executed in parallel?

1 Answers1

8

In all modern (and older) processors, the x87 and SSE instructions use the same execution units, so it's UNLIKELY that you will benefit much from this sort of code. There may be very special cases where you can trick the processor into running for example a x87 divide in parallel with an SSE add, or something like that, but if you are simply doing a big loop of similar operations, there is almost certainly no benefit.

Mats Petersson
  • 126,704
  • 14
  • 140
  • 227
  • So it would be possible (my target processor is the new gen i7, not p3, so i'm safe in terms of mutating the x87 registers by using sse)? – user2366538 May 09 '13 at 14:09
  • Very unlikely that you would benefit from it, like my answer says (unlikely in capitals for emphasis). In other words, it's the SAME pipeline once the instructions have been sufficiently decoded, whether it is x87 or SSE instructions - yes, there are separate registers, but it's unlikely that any FPU instructions are limited by register dependencies such that you benefit there. – Mats Petersson May 09 '13 at 14:11
  • What if I were to quantize some of the data (ahead of time). I could theoretically use the ALU in parallel with x87/sse? – user2366538 May 09 '13 at 14:14
  • Yes, most SSE/x87 instructions will run in parallel with most non-FPU instructions. The processor has quite a bit of "lookahead" on this - I think AMD processors have at least 32 (RISC)instructions that it can look ahead, so again, rearranging the code may not give a huge benefit. – Mats Petersson May 09 '13 at 14:18
  • I thought the key was not the pipeline but the super-scalar architecture. The i7 has 6 ports and can do one AVX addition and one AVX multiply at the same time. So doing x87 and SSE at the same time should be no problem? –  May 10 '13 at 08:03
  • Sorry, are you asking if you can run an arbitrary x87 and an arbitrary SSE instruction in parallel, or if you gain something from mixing SSE and x87 instructions above and beyond just using SSE instructions? If the former, yes, the processor will do that. If the latter, then no. All floating point instructions, whether AVX, SSE or x87 will run through the same execution units - there is one unit for add/subtract and another for multiply/divide (and, I believe, a dedicated load/store for float operations too). On every clock cycle, one instruction can go through each unit. – Mats Petersson May 10 '13 at 08:28
  • The only reason you'd want to mix SSE and x87 is to get more registers. (or for x87 `tan/exp/log` insns, but a good math library for SSE will do just as well.) Even then, it's usually not worth it, because loads from locals in L1 cache are extremely cheap. See http://agner.org/optimize/ for instruction tables, including which execution port each instruction goes to. (Also very excellent guides to optimizing in asm.) – Peter Cordes Aug 02 '15 at 05:29
  • @MatsPetersson: You should take out the speculation about when it might be worth it. An FP add being able to dispatch to an execution port that cycle doesn't depend on whether it's a vector FP divide or an x87 FP divide occupying the other port. – Peter Cordes Aug 02 '15 at 05:37
  • Also, I know `divps` isn't fully pipelined, but I'm not sure if `mulps` can dispatch to port0 before it would be ready for another `divps` (on recent Intel). (i.e. whether they're separate execution units attached to the same execution port, and whether that matters.) – Peter Cordes Aug 02 '15 at 05:38
  • You do realize this question is 2 years old? – Mats Petersson Aug 02 '15 at 08:43