2

I'm reading The Art of Assembly: The MMX Instruction Set", After executing some MMX instructions, the EMMS instruction needs to be executed to reset the FPU. It states the EMMS instruction is quite slow.

However when I profiled the EMMS execution time to see just how slow it was, (using RDTSC to count clock cycles), it appears to execute in 0 cycles.

What's going on? Have I made a mistake somewhere or is Art Of Assembly out of date?

Evan Carroll
  • 78,363
  • 46
  • 261
  • 468

2 Answers2

2

It was slow on the ancient Pentium MMX, but on more modern processors it is very fast.

Still, MMX is mostly obsolete today. Use SSE2, and you'll have no problems multiplexing with the FPU.

Also, the RDTSC instruction can be executed in parallel with other instructions, which explains your measurement - the CPU simply started executing both RDTSCs and the EMMS simultaneously in the same clock cycle... If you want to measure the time a piece of code takes, you must serialize both RDTSCs with regard to the code - usually the CPUID instruction is used for that. As the serializing instructions you use take CPU cycles themselves, you have to also measure how many cycles does it take with no code between the measurement rig you write.

The last point is that even on the Pentium MMX the EMMS instruction itself finished fast - it was the first FPU instruction after that that was getting a nasty delay...

stormsoul
  • 476
  • 2
  • 5
  • `emms` is hardly "very fast" on Intel CPUs from ~2009. On Core2 and Nehalem, it's 11 uops, and has a throughput of one per 6 cycles (http://agner.org/optimize/). It's only worth using MMX if you have a loop, not for a few instructions of 64-bit integer math in 32-bit mode, or a 64-bit copy, if you can't inline it into a larger function. On later CPUs (where MMX is more and more obsolete), EMMS is even slower, e.g. 31 uops / 18 cycles on Sandybridge. – Peter Cordes May 15 '18 at 01:20
  • And if you have SSE2, you don't need x87 at all (except if you actually need 80-bit precision, or if a 32-bit calling convention forces it). – Peter Cordes May 15 '18 at 01:21
  • EMMS on P5MMX takes only one clock; the actual penalty for the first x87 instruction is ~58 clocks, according to Agner Fog's instruction tables. So it's actually cheap on that CPU to just have EMMS at the end of a bunch of functions, if no x87 instructions run. – Peter Cordes May 15 '18 at 01:23
0

You need a serializing instruction, such as CPUID, to ensure that RDTSC is not executed out of order. You can read more here.

zvrba
  • 24,186
  • 3
  • 55
  • 65