5

I'm looking at the spin lock implementation in JVM HotSpot from OpenJDK12. Here is how it is implemented (comments preserved):

// Polite TATAS spinlock with exponential backoff - bounded spin.
// Ideally we'd use processor cycles, time or vtime to control
// the loop, but we currently use iterations.
// All the constants within were derived empirically but work over
// over the spectrum of J2SE reference platforms.
// On Niagara-class systems the back-off is unnecessary but
// is relatively harmless.  (At worst it'll slightly retard
// acquisition times).  The back-off is critical for older SMP systems
// where constant fetching of the LockWord would otherwise impair
// scalability.
//
// Clamp spinning at approximately 1/2 of a context-switch round-trip.
// See synchronizer.cpp for details and rationale.

int Monitor::TrySpin(Thread * const Self) {
  if (TryLock())    return 1;
  if (!os::is_MP()) return 0;

  int Probes  = 0;
  int Delay   = 0;
  int SpinMax = 20;
  for (;;) {
    intptr_t v = _LockWord.FullWord;
    if ((v & _LBIT) == 0) {
      if (Atomic::cmpxchg (v|_LBIT, &_LockWord.FullWord, v) == v) {
        return 1;
      }
      continue;
    }

    SpinPause();

    // Periodically increase Delay -- variable Delay form
    // conceptually: delay *= 1 + 1/Exponent
    ++Probes;
    if (Probes > SpinMax) return 0;

    if ((Probes & 0x7) == 0) {
      Delay = ((Delay << 1)|1) & 0x7FF;
      // CONSIDER: Delay += 1 + (Delay/4); Delay &= 0x7FF ;
    }

    // Stall for "Delay" time units - iterations in the current implementation.
    // Avoid generating coherency traffic while stalled.
    // Possible ways to delay:
    //   PAUSE, SLEEP, MEMBAR #sync, MEMBAR #halt,
    //   wr %g0,%asi, gethrtime, rdstick, rdtick, rdtsc, etc. ...
    // Note that on Niagara-class systems we want to minimize STs in the
    // spin loop.  N1 and brethren write-around the L1$ over the xbar into the L2$.
    // Furthermore, they don't have a W$ like traditional SPARC processors.
    // We currently use a Marsaglia Shift-Xor RNG loop.
    if (Self != NULL) {
      jint rv = Self->rng[0];
      for (int k = Delay; --k >= 0;) {
        rv = MarsagliaXORV(rv);
        if (SafepointMechanism::should_block(Self)) return 0;
      }
      Self->rng[0] = rv;
    } else {
      Stall(Delay);
    }
  }
}

Link to source

Where Atomic::cmpxchg implemented on x86 as

template<>
template<typename T>
inline T Atomic::PlatformCmpxchg<8>::operator()(T exchange_value,
                                                T volatile* dest,
                                                T compare_value,
                                                atomic_memory_order /* order */) const {
  STATIC_ASSERT(8 == sizeof(T));
  __asm__ __volatile__ ("lock cmpxchgq %1,(%3)"
                        : "=a" (exchange_value)
                        : "r" (exchange_value), "a" (compare_value), "r" (dest)
                        : "cc", "memory");
  return exchange_value;
}

Link to source

The thing that I don't understand is the reason behind the backoff on "older SMP" systems. It was said in the commnets that

The back-off is critical for older SMP systems where constant fetching of the LockWord would otherwise impair scalability.

The reason I can imagine is on older SMP systems when fetching and then CASing the LockWord bus lock is always asserted (not cache lock). As it is said in the Intel Manual Vol 3. 8.1.4:

For the Intel486 and Pentium processors, the LOCK# signal is always asserted on the bus during a LOCK operation, even if the area of memory being locked is cached in the processor. For the P6 and more recent processor families, if the area of memory being locked during a LOCK operation is cached in the processor that is performing the LOCK operation as write-back memory and is completely contained in a cache line, the processor may not assert the LOCK# signal on the bus.

Is that the actual reason? Or what is that?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
St.Antario
  • 26,175
  • 41
  • 130
  • 318
  • 3
    I don't think they're talking about x86 at all, and almost certainly not primitive x86 before PPro that can just take a cache-lock. "Niagara-class systems" = [massively multi-threaded UltraSPARC T1](https://en.wikipedia.org/wiki/UltraSPARC_T1). For "earlier SMP" they might mean earlier SPARC, or earlier systems in general with normal MESI that can take use a cache-lock for atomic CAS attempts. Also note that they spin read-only, only even attempting to CAS if they see the lock unlocked. I'm not sure exactly why backoff is still good in this case, or on which systems. – Peter Cordes Oct 29 '19 at 07:01
  • 1
    Probably read attempts (leading to MESI share requests) can disrupt an LL/SC CAS attempt on earlier SPARC CPUs, potentially leading to livelock if a core can't hold onto a line in Exclusive state between LL and SC. Or at least slowdowns. – Peter Cordes Oct 29 '19 at 07:32
  • @PeterCordes _Probably read attempts (leading to MESI share requests) can disrupt an LL/SC CAS attempt on earlier SPARC CPUs_ Sounds like a reasonable idea. At least LL/SC pair can fail spuriously unlike CAS implementation on x86 which AFAIK can never fail for all the competing threads (at least one succeeds). – St.Antario Oct 29 '19 at 07:39
  • 1
    Yes, exactly, creating spurious failure is what I meant. And yes, that's correct; x86 `lock cmpxchg` implements C++11 `compare_exchange_strong` - spurious failure impossible. – Peter Cordes Oct 29 '19 at 07:39
  • 3
    This [link](http://pdos.csail.mit.edu/papers/linux:lock.pdf) should shed some light. It should have to do with the number of MESI shares requested generated after a release: the time needed to satisfy one such request is then linear with the number of threads. By backing off we reduce the number of requests a single core needs to wait for. I'm not sure how old SMP systems where more affected by this than modern ones: probably a more decentralised architecture (i.e. links between the componentes, like the ring lanes, QPI lane or equivalent). – Margaret Bloom Oct 29 '19 at 08:35

0 Answers0