I found the following spinlock code in boost::smart_ptr
:
bool try_lock()
{
return (__sync_lock_test_and_set(&v_, 1) == 0);
}
void lock()
{
for (unsigned k=0; !try_lock(); ++k)
{
if (k<4)
; // spin
else if (k < 16)
__asm__ __volatile__("pause"); // was ("rep; nop" ::: "memory")
else if (k < 32 || k & 1)
sched_yield();
else
{
struct timespec rqtp;
rqtp.tv_sec = 0;
rqtp.tv_nsec = 100;
nanosleep(&rqtp, 0);
}
}
}
void unlock()
{
__sync_lock_release(&v_);
}
So if I understand this correctly, when the lock is contended the incoming thread will exponentially back-off, first spinning wildly, then pausing, then yielding the remainder of its time slice, and finally flip-flopping between sleeping and yielding.
I also found the glibc pthread_spinlock
implementation, which uses assembly to perform the lock.
#define LOCK_PREFIX "lock;" // using an SMP machine
int pthread_spin_lock(pthread_spinlock_t *lock)
{
__asm__ ("\n"
"1:\t" LOCK_PREFIX "decl %0\n\t"
"jne 2f\n\t"
".subsection 1\n\t"
".align 16\n"
"2:\trep; nop\n\t"
"cmpl $0, %0\n\t"
"jg 1b\n\t"
"jmp 2b\n\t"
".previous"
: "=m" (*lock)
: "m" (*lock));
return 0;
}
I will admit that my understanding of assembly is not great, so I don't fully understand what is happening here. (Could someone please explain what this is doing?)
However, I ran some tests against the boost spinlock and glibc pthread_spinlock, and when there are more cores than threads, the boost code outperforms the glibc code.
On the other hand, When there are more threads than cores, the glibc code is better.
Why is this? What is the difference between these two spinlock implementations that makes them perform differently in each scenario?