0

I need CAS functions to use in a context of multiple threads running on the same CPU (assume that all threads are statically glued to selected CPU, via SetThreadAffinityMask).

InterlockedCompareExchange generates LOCK CMPXCHG. The LOCK part comes with side effects such as a cache miss, a bus lock and a potential for contention with other CPU, all of which are nice, but feel like an extravagant excess given the circuimstances. Since the competing threads run on the same CPU, I assume the LOCK can be dropped, and I further assume it should result in improved performance.

So this is my first question - do I assume correctly?

--

I know how to generate CMPXCHG with inline assembly for 32-bit version. Also, as per this SO thread I know how to do for 64-bit version too, but as a function call.

What I don't understand, and this is my second question, is how to generate an inlined version of it.

--

Thanks.

Community
  • 1
  • 1
Angstrom
  • 352
  • 1
  • 2
  • 16
  • Hmm.. if I was trying this, I would use a macro, (OK, ~~shudder~~), or ifdef so I could easily add the lock prefix later, if I suspected problems or found that SetThreadAffinityMask is an udesirable 'optimization'. – Martin James Jan 09 '13 at 21:27
  • So if you're effectively guaranteeing every thread is run uniquely (so the only reads and writes to shared data are from a single thread), you're just writing a single threading program. So why use `InterlockedCompareExchange`? – GManNickG Jan 09 '13 at 21:29
  • 1
    @GManNickG He is using multiple threads on a single logical CPU. Effectively he is battling pre-emptive multitasking here -- he needs it to be a single instruction so that his thread isn't paused in the middle of `if(x == y) x = z`, but he doesn't need `LOCK` because it's still only within a single logical CPU. – Cory Nelson Jan 09 '13 at 21:32
  • @CoryNelson: Ah, gotcha. So he wants the atomic update without the memory fence. Just have to use the implementation guarantees of the platform; that is, pry open MSVC2012's `` and copy `std::atomic::compare_exchange_strong(..., memory_order_relaxed)`. – GManNickG Jan 09 '13 at 21:36
  • @CoryNelson - Don't have MSVC2012, so can't do... though I suspect it's another intrinsic, which is not in the 2010 version. – Angstrom Jan 10 '13 at 08:12
  • Doesn't matter -- it still uses `InterlockedCompareExchange` for `memory_order_relaxed`. I don't think there's an intrinsic to do this. – Cory Nelson Jan 10 '13 at 15:24
  • @CoryNelson - thanks for the follow-up. As per usual, I simply reworked the code to not need lock-less cmpxchg to begin with :) – Angstrom Jan 13 '13 at 17:54

2 Answers2

1

Not to answer my own question, but to describe a workaround, of sorts.

For CAS on boolean variables, it's possible to fall back to _bittestandset, which is slower than CMPXCHG, but has an intrinsic form in VS2010.

Angstrom
  • 352
  • 1
  • 2
  • 16
1

This is really more of a comment, but the space is a little too limited...

I doubt* you'll get the CMPXCHG instruction on its own without the use of assembly. If the region is that critical, use the Interlocked intrinsics, disassemble the output, remove the LOCK override prefix and link that in (I'd do this for both 32 and 64bit variants, as inlined ASM is less than optimal in MSVC, as its always treated as unsafe, causing extra protection cruft to be inserted, which may be worse than calling an external version. On the plus side it'll also give you a more uniform code layout).

I'd also recommend you profile both solutions, with an without the LOCK, as most newer Intel CPU's implement cache-level locks, that greatly reduce the performance impact of the lock (Chapter 8 of the Intel Developer Manual provides a healthy bit of insight into the exact effects of bus locking).

*By "doubt" I mean: it doesn't exist as an explicit intrinsic, and using compiler coercion tricks is very brittle, not that I know of any for coercing the emission of XCHG or CMPXCHG (with the exception of XCHG (E)AX,(E)AX, used as an alignment NO-OP).

Necrolis
  • 25,836
  • 3
  • 63
  • 101