MWAIT vs HALT in terms of efficiency

Question

I'm raising a wonder in regards to MONITOR-MWAIT vs HLT instructions. Both halts the processor, both wakes up on various external triggers (interrupts etc).

In my experiments, HLT and MWAIT function almost the same, when taking in account :

If you are not the OS scheduler, a simple loop with the above mentioned instructions, will be interrupted quite rapidly, and since MWAIT requires re-checking the condition in between MONITOR and MWAIT, what is the difference ? (what i'm asking is, why not using HLT in first place, and saving the pain of allocating tracing area (which, if not carefully configured, avoids the mon/mwait mechanism and turns it into a NOP), since if you're not the OS scheduler, there is no chance you won't wake up rapidly enough to simply check the value in a HLT loop... ???

(surely, MWAIT could be higher res, i haven't measured resolution, but it seems it over-wake-ups by (i assume) interrupts and such).. so i can't see the big advantage.

Thanks, Any thoughts in that manner would be greatly appreciated

score 15 · Answer 1 · answered Nov 20 '12 at 02:37

For performance; what matters most is the time it takes for the CPU to come out of its "waiting" state whenever whatever it is waiting for (an IRQ for HLT, or either an IRQ or a memory write for MWAIT) occurs. This effects latency - e.g. how long it will take before an interrupt handler is started or before a task switch actually occurs. The time taken for a CPU to come out of its waiting state is different for different CPUs, and may also be slightly different for HLT and MWAIT on the same CPU.

The same applies to power consumption - power consumed while waiting can vary a lot between different CPUs (especially when you start thinking about things like hyper-threading); and power consumption of HLT vs. MWAIT may also be slightly different on the same CPU.

For usage, they're intended for different situations. HLT is for waiting for an IRQ, while MWAIT is for waiting for a memory write to occur. Of course if you're waiting for a memory write to occur then you need to decide whether IRQs should interrupt your waiting or not (e.g. you can do CLI then MWAIT if you only want to wait for a memory write).

However, for multi-tasking systems, mostly they're both only used for the same thing - in schedulers where the CPU is idle. Before MONITOR/MWAIT was introduced, schedulers would use HLT while waiting for work to do (to reduce power consumption a little). This means that if another CPU unblocks a task it can't just put that task into the scheduler's queue and has to send a (relatively expensive) "inter-processor interrupt" to the HLTed CPU to knock it out of its HLT state (otherwise the CPU will keep doing nothing when there's work it can/should do). With MWAIT, this "inter-processor interrupt" is (potentially) unnecessary - you can set MONITOR to watch for writes to the scheduler's queue, so that the act of putting the task onto the queue is enough to cause a waiting CPU to stop waiting.

There has also been some research into using MONITOR/MWAIT for things like spinlocks and synchronisation (e.g. waiting for a contended lock to be released). The end result of this research is that the time it takes for the CPU to come out of its "waiting" state is too high and using MONITOR/MWAIT like this causes too much performance loss (unless there are design flaws - e.g. using a spinlock when you should be using a mutex).

I can't think of any other reason (beyond schedulers and locking/synchronisation) to use HLT or MWAIT.

Brendan, Thanks a lot for this very informative overview, You surely cleared the fog i had around this. I was thinking of using mon/mwait to synchronize between thread and HW (to avoid setEvent from DPC coming after each interrupt coming from the HW). It seems that the latency around WaitForSingleObj and SetEvent, in compare to a spinlock in the waiting thread (and a global var), is much higher, so mwait seems like a point somewhere in the middle.. — win32 devPart, Nov 21 '12 at 14:50

score 11 · Answer 2 · answered Aug 08 '13 at 16:20

The HLT instruction implements the shallowest idle power state (C-State) available for an individual thread, whereas the MWAIT instruction allows you to request all available idle power states as well as sub-states.

At the hardware level, executing HLT is equivalent to executing MWAIT with a state hint of 0. This puts the processor in the C1 state, which is clock gating for the core. If you want to enter deeper C-States in order to power gate the core and potentially power gate the package, you must use MWAIT.

There's always a tradeoff between power savings and exit latency for various power states. The deeper the C-State, the more power savings, but the longer it takes to exit the C-State. You should also note that modern x86 processors will limit the depth of the power state based on the frequency of interrupts (i.e. if you're receiving break events every 1 us, hardware will not attempt to enter a C-State with a 2 us exit latency).

In addition to hardware inhibiting entered C-State, some C-States may only be entered through coordination between threads. For instance, on an Intel x86 processor with Hyper-threading, both threads in a core must request a power-gated C-State for power-gating to occur at the core level, and likewise all cores in a package must request a package-level power-gated C-State for power-gating to occur at the package level. The hardware generally abides by the shallowest request, so if 1 thread requests C1 and another requests C3, the processor enters C1.

If you aren't controlling the operating system, then it's really a moot point (since MWAIT is only available at CPL0). If you "own" the operating system, then it will almost always make sense to use MWAIT instead of HLT, since it results in much higher power savings in many cases and provides access to the same idle power state that HLT does.

score 4 · Answer 3 · answered Jun 26 '13 at 03:01

4

MONITOR/MWAIT should be usable "for things like spinlocks and synchronisation (e.g. waiting for a contended lock to be released)."

However, MONITOR/MWAIT (a) for an amazingly stupid and annoying reason had to be restricted to only be used by ring 0 kernel code, not user code, and (b) became loaded down with microcode to go into low power sleep states.

Some companies have implemented similar or equivalent instructions better, e.g.MIPS' LL/PAUSE is roughly equivalent to MONITOR/MWAIT.

answered Jun 26 '13 at 03:01

Krazy Glew

7,210
2
49
62

2

KNL has a ring 3 implementation: https://software.intel.com/en-us/blogs/2016/10/06/intel-xeon-phi-product-family-x200-knl-user-mode-ring-3-monitor-and-mwait – Jeff Hammond Oct 10 '16 at 17:10
Thanks for the news about user mode MWAIT in KNL. I look forward to seeing performance data. My guess is that unless they have implemented microcode branch prediction, it will be quite slow. – Krazy Glew Oct 11 '16 at 05:00
What colleagues have told me is that a carefully tuned spin loop with pause instructions is faster than monitor-mwait, but that if one is willing to trade a bit of latency, then monitor-mwait should be viable. Even if the direct benefit isn't there, there is a huge indirect benefit from parking hardware threads in a power-constrained environment, which many supercomputers are. If you have a benchmark, feel free to contact me privately and I will try to get you some data. – Jeff Hammond Oct 12 '16 at 13:01

MWAIT vs HALT in terms of efficiency

3 Answers3