4

I'm trying to implement a high performance blocking queue backed by a circular buffer on top of pthreads, semaphore.h and gcc atomic builtins. The queue needs to handle multiple simulataneous readers and writers from different threads.

I've isolated some sort of race condition, and I'm not sure if it's a faulty assumption about the behavior of some of the atomic operations and semaphores, or whether my design is fundamentally flawed.

I've extracted and simplified it to the below standalone example. I would expect that this program never returns. It does however return after a few hundred thousand iterations with corruption detected in the queue.

In the below example (for exposition) it doesn't actually store anything, it just sets to 1 a cell that would hold the actual data, and 0 to represent an empty cell. There is a counting semaphore (vacancies) representing the number of vacant cells, and another counting semaphore (occupants) representing the number of occupied cells.

Writers do the following:

  1. decrement vacancies
  2. atomically get next head index (mod queue size)
  3. write to it
  4. increment occupants

Readers do the opposite:

  1. decrement occupants
  2. atomically get next tail index (mod queue size)
  3. read from it
  4. increment vacancies

I would expect that given the above, precisely one thread can be reading or writing any given cell at one time.

Any ideas about why it doesn't work or debugging strategies appreciated. Code and output below...

#include <stdlib.h>
#include <semaphore.h>
#include <iostream>

using namespace std;

#define QUEUE_CAPACITY 8 // must be power of 2
#define NUM_THREADS 2

struct CountingSemaphore
{
    sem_t m;
    CountingSemaphore(unsigned int initial) { sem_init(&m, 0, initial); }
    void post() { sem_post(&m); }
    void wait() { sem_wait(&m); }
    ~CountingSemaphore() { sem_destroy(&m); }
};

struct BlockingQueue
{
    unsigned int head; // (head % capacity) is next head position
    unsigned int tail; // (tail % capacity) is next tail position
    CountingSemaphore vacancies; // how many cells are vacant
    CountingSemaphore occupants; // how many cells are occupied

    int cell[QUEUE_CAPACITY];
// (cell[x] == 1) means occupied
// (cell[x] == 0) means vacant

    BlockingQueue() :
        head(0),
        tail(0),
        vacancies(QUEUE_CAPACITY),
        occupants(0)
    {
        for (size_t i = 0; i < QUEUE_CAPACITY; i++)
            cell[i] = 0;
    }

    // put an item in the queue
    void put()
    {
        vacancies.wait();

        // atomic post increment
        set(__sync_fetch_and_add(&head, 1) % QUEUE_CAPACITY);

        occupants.post();
    }

    // take an item from the queue
    void take()
    {
        occupants.wait();

        // atomic post increment
        get(__sync_fetch_and_add(&tail, 1) % QUEUE_CAPACITY);

        vacancies.post();
    }

    // set cell i
    void set(unsigned int i)
    {
        // atomic compare and assign
        if (!__sync_bool_compare_and_swap(&cell[i], 0, 1))
        {
            corrupt("set", i);
            exit(-1);
        }
    }

    // get cell i
    void get(unsigned int i)
    {
        // atomic compare and assign
        if (!__sync_bool_compare_and_swap(&cell[i], 1, 0))
        {
            corrupt("get", i);
            exit(-1);
        }
    }

    // corruption detected
    void corrupt(const char* action, unsigned int i)
    {
        static CountingSemaphore sem(1);
        sem.wait();

        cerr << "corruption detected" << endl;
        cerr << "action = " << action << endl;
        cerr << "i = " << i << endl;
        cerr << "head = " << head << endl;
        cerr << "tail = " << tail << endl;

        for (unsigned int j = 0; j < QUEUE_CAPACITY; j++)
            cerr << "cell[" << j << "] = " << cell[j] << endl;
    }
};

BlockingQueue q;

// keep posting to the queue forever
void* Source(void*)
{
    while (true)
        q.put();

    return 0;
}

// keep taking from the queue forever
void* Sink(void*)
{
    while (true)
        q.take();

    return 0;
} 

int main()
{
    pthread_t id;

    // start some pthreads to run Source function
    for (int i = 0; i < NUM_THREADS; i++)
        if (pthread_create(&id, NULL, &Source, 0))
            abort();

    // start some pthreads to run Sink function
    for (int i = 0; i < NUM_THREADS; i++)
        if (pthread_create(&id, NULL, &Sink, 0))
            abort();

    while (true);
}

Compile the above as follows:

    $ g++ -pthread AboveCode.cpp
    $ ./a.out

The output is different every time, but here is one example:

    corruption detected
    action = get
    i = 6
    head = 122685
    tail = 122685
    cell[0] = 0
    cell[1] = 0
    cell[2] = 1
    cell[3] = 0
    cell[4] = 1
    cell[5] = 0
    cell[6] = 1
    cell[7] = 1

My system is Ubuntu 11.10 on Intel Core 2:

    $ uname -a
    Linux 3.0.0-14-generic #23-Ubuntu SMP \
      Mon Nov 21 20:28:43 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
    $ cat /proc/cpuinfo | grep Intel
    model name : Intel(R) Core(TM)2 Quad  CPU   Q9300  @ 2.50GHz
    $ g++ --version
    g++ (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1

Thanks, Andrew.

Andrew Tomazos
  • 66,139
  • 40
  • 186
  • 319
  • Unfortunately it's hard to understand what's going wrong here... any reason you are not using an already existing and debugged version ? For example, performance wise, you might have false sharing issues with you version on multi-core systems. – Matthieu M. Jan 05 '12 at 07:47
  • By false sharing issues you mean the implicit memory barriers in the atomic operations? Forcing the Lx caches to be written back to main memory etc. I'm not sure how that can be avoided? Two or more of the readers/writers may be running on different threads on different CPUs so how else do you propose to synchronize them? – Andrew Tomazos Jan 05 '12 at 08:08
  • Not really. False sharing is about cache contentions issue when two different cores access different variables... that just happen to fall in the same cache line. Since exclusive ownership (required for writing) is negotiated on a cache line basis, the two cores must serialize their operations even though they are accessing and modifying two semantically distinct variables. It may really hurt performance... and existing optimized implementations already solved the issue :) – Matthieu M. Jan 05 '12 at 08:18
  • Which existing implementations solve this issue and how? Can you provide a file/line reference as an example. – Andrew Tomazos Jan 05 '12 at 08:24

3 Answers3

4

One of possible situations, traced step by step for two writer threads (W0, W1) and one reader thread (R0). W0 entered put() earlier than W1, was interrupted by OS or hardware and finished later.

        w0 (core 0)               w1 (core 1)                r0
t0         ----                      ---       blocked on occupants.wait() / take
t1      entered put()                ---                    ---         
t2      vacancies.wait()           entered put()            ---
t3      got new_head = 1           vacancies.wait()         ---
t4     <interrupted by OS>         got new_head = 2         ---
t5                                 written 1 at cell[2]     ---
t6                                 occupants.post();        ---
t7                                 exited put()            waked up
t8                                   ---               got new_tail = 1
t9     <still in interrupt>          ---    read 0 from ceil[1]  !! corruption !!
t10     written 1 at cell[1]                           
t11     occupants.post();
t12     exited put()
alexander
  • 2,703
  • 18
  • 16
  • Yeah this is it, thanks a lot. Just because writers are ordered on entry by atomic incrementing sequence number, does not mean they will post to readers in same order. Need a second sequence number (ie head_start, head_end, tail_start, tail_end), and only increment *_end and post semaphore once read/write operation is completed up to that "low water mark". – Andrew Tomazos Jan 05 '12 at 20:20
  • Glad to hear it. Good luck with your experiments! – alexander Jan 05 '12 at 21:05
1

From a design point of view, I would consider the whole queue as a shared resource and protect it with a single mutex.

Writers do the following:

  1. take the mutex
  2. write to the queue (including handling of indexes)
  3. free the mutex

Readers do the following:

  1. take the mutex
  2. read from the queue (including handling of indexes)
  3. free the mutex
mouviciel
  • 66,855
  • 13
  • 106
  • 140
  • 1
    This would provide unnecessary serialization of readers and writers. Suppose the queue is empty and QUEUE_CAPACITY writers come along from QUEUE_CAPACITY different threads. Ideally you would like the QUEUE_CAPACITY constructors running in parallel on the QUEUE_CAPACITY different items of the buffer. With your design only one constructor will be able to run at a time as they wait for the mutex. – Andrew Tomazos Jan 05 '12 at 09:30
  • While this approach will introduce serialization it will also be much easier to achieve correctness. – Bowie Owens Jan 05 '12 at 11:44
0

I have a theory. It's a circular queue so one reading thread may be getting lapped. Say a reader takes index 0. Before it does anything it loses the CPU. Another reader thread takes index 1, then 2, then 3 ... then 7, then 0. The first reader wakes up and both threads think they have exclusive access to index 0. Not sure how to prove it. Hope that helps.

Bowie Owens
  • 2,798
  • 23
  • 20
  • readers (and writers) need to wait on a counting semaphore before they can enter the body of the function. Say the queue was full. occupants = 8 and vacancies = 0. The reader on cell 0 gets access decrementing occupants to 7, The reader on cell 1 gets access decrementing occupants to 6, ..., The reader on cell 7 gets access decrementing occupants to 0. The second reader on cell 0 blocks because no counting semaphore left. He will block until a reader exits incrementing vacancies to 1, and then a writer comes in decrementing vacancies, writing and then incrementing occupants. – Andrew Tomazos Jan 05 '12 at 08:38
  • AARGHH... and I've figured it out, what if instead of reader 0 exiting first and posting the vacancy, its reader 4 (say). then the writer will come in and overwrite cell 0. That's the problem, now what's the solution though. :) – Andrew Tomazos Jan 05 '12 at 08:40
  • But the queue doesn't have to be full to lap. It only needs to have one entry at a time. – Bowie Owens Jan 05 '12 at 08:51
  • occupants + vacancies <= capacity at all times, so its impossible to lap. Say the queue has 5 items in it and no readers/writers. than occupants = 5 and vacancies = 3. Four writers come along, 3 get access and the fourth gets blocks. The logic is sound, its just a faulty assumption that because readers (and writers) enter in the correct order that they also exit in the same order. – Andrew Tomazos Jan 05 '12 at 08:58
  • When I say lap I mean that a circular queue with capacity n returns to index 0 after every n inserts ie head % n == 0 for head = 0, head = n, head = 2 * n and so on. I wasn't talking about violating the capacity. We are describing the same problem though. – Bowie Owens Jan 05 '12 at 10:43
  • I've not yet seen any solution anywhere to achieving safety in the queue except by locking up the entire queue, in order to prevent the entry by multiple readers/writers, with a mutex/futex as suggested by @BowieOwens posts. I'm prepared to be surprised, but I'm not holding my breath.. In any case, I suspect that the two kernel calls for the semaphores will take substantially more time than a futex lock, so defeating the purpose of the lock-free optimization, even if it can be made to work. – Martin James Jan 05 '12 at 16:13
  • If your aim is to improve performance with multiple producers and consumers, there are some other optimizations that can be made. It is possible to avoid most kernel calls in the cases where a producer arrives and the queue is not full and when a consumer arrives and the queue is not empty. Doing this requires a futex/CS type lock where a short spin is backed up by a kernel lock - you can examine the queue indexes inside the futex to determine whether the queue is empty or full. I've implemented a blocking queue using this, but sadly, only an unbounded one, (though it did have timeouts!). – Martin James Jan 05 '12 at 16:32
  • So a semaphore (sem_wait, sem_post) is implemented with an atomic CAS operation (CMPXCHG on x86) that stays in userland (faster than syscall to futex) in the non-contended case (permits > 0) and is implemented with a futex in the blocking case. So basically what you describe is exactly how sem_wait/sem_post work already. See sem_post.S and sem_wait.S in glibc archive directory "nptl/sysdeps/unix/sysv/linux/x86_64/". – Andrew Tomazos Jan 05 '12 at 20:15