What level of locking granularity is good in concurrent data structures?

Question

I am quite new to multi-threading, I have a single threaded data analysis app that has a good bit of potential for parallelization and while the data sets are large it does not come close to saturating the hard-disk read/write so I figure I should take advantage of the threading support that is now in the standard and try to speed the beast up.

After some research I decided that producer consumer was a good approach for the reading of data from the disk and processing it and I started writing an object pool that would become part of the circular buffer that will be where the producers put data and the consumers get the data. As I was writing the class it felt like I was being too fine grained in how I was handling locking and releasing data members. It feels like half the code is locking and unlocking and like there are an insane number of synchronization objects floating around.

So I come to you with a class declaration and a sample function and this question: Is this too fine-grained? Not fine grained enough? Poorly thought out?

struct PoolArray
{
public:
    Obj* arr;
    uint32 used;
    uint32 refs;
    std::mutex locker;
};

class SegmentedPool
{
public: /*Construction and destruction cut out*/
    void alloc(uint32 cellsNeeded, PoolPtr& ptr);
    void dealloc(PoolPtr& ptr);
    void clearAll();
private:
    void expand();

    //stores all the segments of the pool
    std::vector< PoolArray<Obj> > pools;
    ReadWriteLock poolLock;

    //stores pools that are empty
    std::queue< int > freePools;
    std::mutex freeLock;

    int currentPool;
    ReadWriteLock currentLock;
};

void SegmentedPool::dealloc(PoolPtr& ptr)
{
    //find and access the segment
    poolLock.lockForRead();
    PoolArray* temp = &(pools[ptr.getSeg()]);
    poolLock.unlockForRead();
    //reduce the count of references in the segment
    temp->locker.lock();
    --(temp->refs);
    //if the number of references is now zero then set the segment back to unused
    //and push it onto the queue of empty segments so that it can be reused
    if(temp->refs==0)
    {
        temp->used=0;
        freeLock.lock();
        freePools.push(ptr.getSeg());
        freeLock.unlock();
    }
    temp->locker.unlock();
    ptr.set(NULL,-1);
}

A few explanations: First PoolPtr is a stupid little pointer like object that stores the pointer and the segment number in the pool that the pointer came from.

Second this is all "templatized" but i took those lines out to try to reduce the length of the code block

Third ReadWriteLock is something I put together using a mutex and a pair of condition variables.

score 3 · Accepted Answer · 2012-12-06T09:26:23.753

3

Locks are inefficient no matter how fine grained they are, so avoid at all cost.

Both queue and vector can be easyly implemented lock free using compare-swap primitive.

there are a number of papers on the topic

Lock free queue:

Lock free vector:

http://www.stroustrup.com/lock-free-vector.pdf

Straustrup's paper also refers to lock-free allocator, but don't jump at it right away, standard allocators are pretty good these days.

UPD If you don't want to bother writing your own containers, use Intel's Threading Building Blocks library, it provide both thread safe vector and queue. They are NOT lock free, but they are optimized to use CPU cache efficiently.

UPD Regarding PoolArray, you don't need a lock there either. If you can use c++11, use std::atomic for atomic increments and swaps, otherwise use compiler built-ins (InterLocked* functions in MSVC and _sync* in gcc http://gcc.gnu.org/onlinedocs/gcc-4.1.1/gcc/Atomic-Builtins.html)

edited Dec 06 '12 at 09:26

answered Dec 06 '12 at 08:51

Right now I am trying to keep things simple with regards to the additions needed to support the multithreading. I guess I have to ask is there anything I can do to avoid having to implementing my own vector and queue while still being more efficient in locking and unlocking? – James Matta Dec 06 '12 at 08:59
@JamesMatta, I updated my answer. in short: use Threading Building Blocks library – Dec 06 '12 at 09:06
You don't have to implement it for yourself. There are enough lockfree data structures out there already, you just have to use them ;) This is even easier than to lock the access to your `std::vector`s and `std::queue`s, because you don't have to wrap your mind around all the corner cases to figure out how the locking has to be done - and you won't have to find the one place that you did not think of and that's causing sporadic race conditions. Bottom line: if you have to use data structures concurrently, reuse existing data structures designed for concurrency. Don't reinvent the wheel. – Arne Mertz Dec 06 '12 at 09:11
@ArneMertz, whilst I agree with you, I couldn't find any production ready lock free queue and vector implementations (I needed them for my project). – Dec 06 '12 at 09:14
@Aleguna, Hmmm I like that it is open source, I might have to try it, and then if there are too many user complaints about the number of external libraries they have to use (I already use Qt and MPIR) then I might look into rolling my own. – James Matta Dec 06 '12 at 09:15
Also, do you have any comments on the locking and unlocking for the individual segments? or is that unavoidable? Looking at it, that pretty much reduces locking and unlocking to once per function but I am wondering at the overhead of having a few hundred mutexes floating around. – James Matta Dec 06 '12 at 09:17
Your fine grained approach looks ok and several hundreds of mutexes it itself shouldn't be a problem in itself. Its actual locking that is potentially expensive (switching to kernnel mode). Besides, it could cause all sorts of horrible problems - races, deadlocks etc. Also see my update on `PoolArray` – Dec 06 '12 at 09:24
@aleguna: Thank you very much for both of your updates they helped a lot. I am going to look at both TBB library and the papers. I already sorta knew that locks were expensive so it might just be worth it to roll my own, for the experience, the speed, and having those classes available in the future. Also I had forgotten about the atomic integers stuff, I will have to think about if the atomic operations are sufficient to keep things sane while allocating onto segments. – James Matta Dec 06 '12 at 09:43
Perhaps, the [Lock-Free Single-Producer - Single Consumer Circular Queue](http://www.codeproject.com/Articles/43510/Lock-Free-Single-Producer-Single-Consumer-Circular) implementation might be of help – Dmitry Ledentsov Dec 06 '12 at 09:49
@DmitryLedentsov, single producer - single consumer, is not very useful in most cases unfortunatelly – Dec 06 '12 at 09:53
@aleguna, sure, but that doesn't mean you can't use it in a more complex scenario. The easiest way is to have multiple queues for different subscriptions. If the parallelization doesn't need communication, one could use a queue per worker thread, where the thread waiting for the results would aggregate those from each of these queues – Dmitry Ledentsov Dec 06 '12 at 09:59

score 1 · Answer 2 · answered Dec 06 '12 at 09:12

A good start - you lock things when needed and free them as soon as you're finished.

Your ReadWriteLock is pretty much a CCriticalSection object - depending on your needs it might improve performance to use that instead.

One thing I would say is that call your temp->locker.lock(); function before you release the lock on the pool poolLock.unlockForRead();, otherwise you're performing operations on the pool object when it's not under synchronisation control - it could be being used by another thread at that point. A minor point, but with multi-threading it's the minor points that trip you up in the end.

A good approach to take to multi-threading is to wrap any controlled resources in objects or functions that do the locking and unlocking inside them, so anyone who wants access to the data doesn't have to worry about which lock to lock or unlock, and when to do it. for example:

  ...
  if(temp->refs==0)
  {
    temp->used=0;
    freeLock.lock();
    freePools.push(ptr.getSeg());
    freeLock.unlock();
  }
  ...

would be...

  ...
  if(temp->refs==0)
  {
    temp->used=0;
    addFreePool(ptr.getSeg());
  }
  ...

void SegmentedPool::addFreePool(unsigned int seg)
{
  freeLock.lock();
  freePools.push(seg);
  freeLock.unlock();
}

There are plenty of multi-threading benchmarking tools out there too. You can play around with controlling your resources in different ways, run it through one of the tools, and see where any bottlenecks are if you feel like performance is becoming an issue.

The advice about unlocking the pool segment has now been implemented, thanks for that. Unfortunately a bit of googling reveals that critical section is an microsoft thing and unfortunately this code needs to run on both linux and windows which sort of puts a damper on that. Also thanks for the advice about wrapping the controlled resources. — James Matta, Dec 06 '12 at 09:33

What level of locking granularity is good in concurrent data structures?

2 Answers2