I'm looking at the spin lock implementation in JVM HotSpot from OpenJDK12. Here is how it is implemented (comments preserved):
// Polite TATAS spinlock with exponential backoff - bounded spin.
// Ideally we'd use processor cycles, time or vtime to control
// the loop, but we currently use iterations.
// All the constants within were derived empirically but work over
// over the spectrum of J2SE reference platforms.
// On Niagara-class systems the back-off is unnecessary but
// is relatively harmless. (At worst it'll slightly retard
// acquisition times). The back-off is critical for older SMP systems
// where constant fetching of the LockWord would otherwise impair
// scalability.
//
// Clamp spinning at approximately 1/2 of a context-switch round-trip.
// See synchronizer.cpp for details and rationale.
int Monitor::TrySpin(Thread * const Self) {
if (TryLock()) return 1;
if (!os::is_MP()) return 0;
int Probes = 0;
int Delay = 0;
int SpinMax = 20;
for (;;) {
intptr_t v = _LockWord.FullWord;
if ((v & _LBIT) == 0) {
if (Atomic::cmpxchg (v|_LBIT, &_LockWord.FullWord, v) == v) {
return 1;
}
continue;
}
SpinPause();
// Periodically increase Delay -- variable Delay form
// conceptually: delay *= 1 + 1/Exponent
++Probes;
if (Probes > SpinMax) return 0;
if ((Probes & 0x7) == 0) {
Delay = ((Delay << 1)|1) & 0x7FF;
// CONSIDER: Delay += 1 + (Delay/4); Delay &= 0x7FF ;
}
// Stall for "Delay" time units - iterations in the current implementation.
// Avoid generating coherency traffic while stalled.
// Possible ways to delay:
// PAUSE, SLEEP, MEMBAR #sync, MEMBAR #halt,
// wr %g0,%asi, gethrtime, rdstick, rdtick, rdtsc, etc. ...
// Note that on Niagara-class systems we want to minimize STs in the
// spin loop. N1 and brethren write-around the L1$ over the xbar into the L2$.
// Furthermore, they don't have a W$ like traditional SPARC processors.
// We currently use a Marsaglia Shift-Xor RNG loop.
if (Self != NULL) {
jint rv = Self->rng[0];
for (int k = Delay; --k >= 0;) {
rv = MarsagliaXORV(rv);
if (SafepointMechanism::should_block(Self)) return 0;
}
Self->rng[0] = rv;
} else {
Stall(Delay);
}
}
}
Where Atomic::cmpxchg
implemented on x86 as
template<>
template<typename T>
inline T Atomic::PlatformCmpxchg<8>::operator()(T exchange_value,
T volatile* dest,
T compare_value,
atomic_memory_order /* order */) const {
STATIC_ASSERT(8 == sizeof(T));
__asm__ __volatile__ ("lock cmpxchgq %1,(%3)"
: "=a" (exchange_value)
: "r" (exchange_value), "a" (compare_value), "r" (dest)
: "cc", "memory");
return exchange_value;
}
The thing that I don't understand is the reason behind the backoff on "older SMP" systems. It was said in the commnets that
The back-off is critical for older SMP systems where constant fetching of the LockWord would otherwise impair scalability.
The reason I can imagine is on older SMP systems when fetching and then CASing the LockWord
bus lock is always asserted (not cache lock). As it is said in the Intel Manual Vol 3. 8.1.4:
For the Intel486 and Pentium processors, the
LOCK#
signal is always asserted on the bus during aLOCK
operation, even if the area of memory being locked is cached in the processor. For the P6 and more recent processor families, if the area of memory being locked during aLOCK
operation is cached in the processor that is performing theLOCK
operation as write-back memory and is completely contained in a cache line, the processor may not assert theLOCK#
signal on the bus.
Is that the actual reason? Or what is that?