7

I found the following spinlock code in boost::smart_ptr:

bool try_lock()
{
    return (__sync_lock_test_and_set(&v_, 1) == 0);
}
void lock()
{    
    for (unsigned k=0; !try_lock(); ++k)
    {
        if (k<4)
            ; // spin
        else if (k < 16)
            __asm__ __volatile__("pause"); // was ("rep; nop" ::: "memory")
        else if (k < 32 || k & 1)
            sched_yield();
        else
        {
            struct timespec rqtp;
            rqtp.tv_sec  = 0;
            rqtp.tv_nsec = 100;
            nanosleep(&rqtp, 0);
        }
    }
}
void unlock()
{
     __sync_lock_release(&v_);
}

So if I understand this correctly, when the lock is contended the incoming thread will exponentially back-off, first spinning wildly, then pausing, then yielding the remainder of its time slice, and finally flip-flopping between sleeping and yielding.

I also found the glibc pthread_spinlock implementation, which uses assembly to perform the lock.

#define LOCK_PREFIX "lock;" // using an SMP machine

int pthread_spin_lock(pthread_spinlock_t *lock)
{
    __asm__ ("\n"
       "1:\t" LOCK_PREFIX "decl %0\n\t"
       "jne 2f\n\t"
       ".subsection 1\n\t"
       ".align 16\n"
       "2:\trep; nop\n\t"
       "cmpl $0, %0\n\t"
       "jg 1b\n\t"
       "jmp 2b\n\t"
       ".previous"
       : "=m" (*lock)
       : "m" (*lock));

    return 0;
}

I will admit that my understanding of assembly is not great, so I don't fully understand what is happening here. (Could someone please explain what this is doing?)

However, I ran some tests against the boost spinlock and glibc pthread_spinlock, and when there are more cores than threads, the boost code outperforms the glibc code.

On the other hand, When there are more threads than cores, the glibc code is better.

Why is this? What is the difference between these two spinlock implementations that makes them perform differently in each scenario?

Steve Lorimer
  • 27,059
  • 17
  • 118
  • 213
  • Funny, I made a similar test a few years ago and got to the same conclusion: `pthread_spin_lock` is more efficient than manual spinlocks (line the one from boost) when there is a lot of contention. – Maxim Egorushkin Jun 20 '12 at 13:57

1 Answers1

5

Where did you get the pthread_spin_lock() implementation posted in the question? It seems to be missing a couple important lines.

The implementation I see (which isn't inline assembly - it's a stand-alone assembly source file from glibc/nptl/sysdeps/i386/pthread_spin_lock.S) looks similar, but has the two additional critical instructions:

#include <lowlevellock.h>

    .globl  pthread_spin_lock
    .type   pthread_spin_lock,@function
    .align  16
pthread_spin_lock:
    mov 4(%esp), %eax
1:  LOCK
    decl    0(%eax)
    jne 2f
    xor %eax, %eax
    ret

    .align  16
2:  rep
    nop
    cmpl    $0, 0(%eax)
    jg  1b
    jmp 2b
    .size   pthread_spin_lock,.-pthread_spin_lock

It decrements a long pointed to by the passed in parameter and returns if the result is zero.

Otherwise, the result was non-zero which means this thread didn't acquire the lock. So it performs a rep nop, which is equivalent to the pause instruction. This is a 'special' nop that gives a hint to the CPU that the thread is in a spin, and the cpu should handle memory ordering and/or branch prdiction in some manner that improves performance in these situations (I don't pretend to understand exactly what happens differently under the chip's covers - from the software's point of view, there's no difference from a plain old nop).

After the pause it checks the value again - if it's greater than zero, the lock is unclaimed so it jumps to the top of the function and tries to claim the lock again. Otherwise, it jumps to the pause again.

The main difference between this spinlock and the Boost version is that this one never does anything fancier than a pause when it's spinning - there's nothing like a sched_yield() or nanosleep(). So the thread stays hot. I'm not sure precisely how this plays in the two behaviors you noted, but the glibc code will be greedier - if a thread is spinning on the lock and there are other threads ready to run but no available core, the spinning thread doesn't help the waiting thread get any cpu time, while the Boost version will eventually voluntarily make way for the thread that's waiting for some attention.

Michael Burr
  • 333,147
  • 50
  • 533
  • 760
  • I found it somewhere online - unfortunately I can't recall exactly where - I used the pthread_spin_lock in my tests, found out the results I reported, and when I saw the differences, went looking for the source code to try understand what was happening, found that assembly, and when I couldn't understand, came looking for help here. Thanks for the response and explanation! – Steve Lorimer Jul 13 '12 at 04:38
  • I wonder This seems counter intuitive to me - the greedier pthread implementation which doesn't give up its core under contention performs better when the cores are over-subscribed (more threads than cores) - I would have expected the boost implementation to do better in that scenario. – Steve Lorimer Jul 16 '12 at 03:15
  • @lori: it hard to say what's going on - we have no information about the benchmarks. That said, I'm not sure how often using spinlocks make sense in user mode code. Even if they did at times, I think that they should be used only when one would expect low-contention and that they'd be held as briefly as possible. Performing the kinds of checks that Boost does might be a bit of overengineering. Then again, that's all just opinion based on gut feeling, not data. – Michael Burr Jul 16 '12 at 06:24