C: Multi producer, multi consumer bounded queue

Question

I try (better tried) to implement a circular buffer with the following interface:

ring_buffer *ring_buffer_create(int capacity, int element_size);
void ring_buffer_destroy(ring_buffer *buffer)

const void *ring_buffer_read_acquire(ring_buffer *buffer, ring_buffer_loc *loc);
void ring_buffer_read_finish(ring_buffer *buffer, ring_buffer_loc loc);

void *ring_buffer_write_acquire(ring_buffer *buffer, ring_buffer_loc *loc);
void ring_buffer_write_finish(ring_buffer *buffer, ring_buffer_loc loc);

It should be possible to read / write multiple elements concurrently (and even in parallel). E.g.:

ring_buffer *buffer = ring_buffer_create(10, sizeof(int));

/* Write a single element */
ring_buffer_loc loc0;
int *i0 = ring_buffer_write_acquire(buffer, &loc);
*i0 = 42; // this could be a big data structure and way more expensive
ring_buffer_write_finish(buffer, loc0);

/* Write "concurrently" */
ring_buffer_loc loc1, loc2;
int *i1 = ring_buffer_write_acquire(buffer, &loc);
int *i2 = ring_buffer_write_acquire(buffer, &loc);
*i1 = 1729;
*i2 = 314;
ring_buffer_write_finish(buffer, loc1);
ring_buffer_write_finish(buffer, loc2);

All "acquire"-functions should be blocking until the operation is possible.

So far, so good. I thought this is simple and so I started with a clean implementation which is based on mutex. But soon I could see that this was far too slow for my use-case (100'000 writes and reads per second), so I switched over to spin-locks etc.

My implementation became quite messy and at some point (now), I started to think about why not something "simple" like this with the desired interface already exists? Probably, it is anyway not a great idea to re-implement something like this.

Maybe someone knows an implementation which has such an interface and which is blocking if the operation is not possible? I was looking quite long in the internet, but I could not find a good match for my problem. Maybe my desired interface is just "bad" or "wrong"?

Nevertheless, I add my current code. It basically assigns each "cell" (=value) a state which can be NONE (not set; the cell is basically empty), WRITING (someone acquired the cell to write data), READING (someone acquired the cell to read) and SET (the cell has a value which could be read). Each cell has a spin-lock which is used to update the cell state.

It then works like this: When someone acquires a read and the current cell has the state "SET", then the value can be read (new state is READING) and the read index is increased. In all other cases a conditional variable is used to wait until an element is available. When an element read is finished, the cell state is changed to NONE and if any writers are waiting, a conditional variable signal is sent.

The same is true if a cell write is acquires. The only difference is that the cell needs the state "NONE" to be used and possible readers are signaled if there are any.

For some reasons the code sometimes locks and so I had to add a "dirty" timeout to my conditional variable. I would already be super happy if this could be solved, because the "timeout" basically makes the code polling (which is relatively ugly) and at the same time many context switches are done. Maybe someone sees the bug? The "new" code also has the disadvantage that it sometimes is really slow which is like a killer for my application. I attached the "old" and the "new" code (the changed lines are marked).

Thank you for helping me:)!

#include <stdio.h>
#include <stdlib.h>
#include <stdatomic.h>
#include <stdbool.h>
#include <string.h>
#include <time.h>
#include <assert.h>
#include <pthread.h>
#include <errno.h>
#include <unistd.h>


typedef int ring_buffer_loc;

enum t_ring_buffer_cell_state
{
    NONE        = 0,
    WRITING     = 1,
    READING     = 2,
    SET         = 3
};

typedef struct {
    char *buffer;                   // data
    atomic_int_fast8_t *states;     // state per cell
    pthread_spinlock_t *locks;      // lock per cell
    int capacity;
    int element_size;

    pthread_spinlock_t read_i_lock;
    int read_i;
    pthread_spinlock_t write_i_lock;
    int write_i;

    pthread_spinlock_t waiting_readers_lock;
    int waiting_readers;
    pthread_spinlock_t waiting_writers_lock;
    int waiting_writers;

    pthread_mutex_t value_written_lock;
    pthread_mutex_t value_read_lock;
    pthread_cond_t value_written;
    pthread_cond_t value_read;
} ring_buffer;

ring_buffer *ring_buffer_create(int capacity, int element_size)
{
    ring_buffer *res = calloc(1, sizeof(ring_buffer));
    res->buffer = calloc(capacity, element_size);
    res->states = calloc(capacity, sizeof(*res->states));
    res->locks = malloc(capacity * sizeof(*res->locks));
    for (int i = 0; i < capacity; ++i) {
        pthread_spin_init(&res->locks[i], PTHREAD_PROCESS_PRIVATE);
    }
    pthread_spin_init(&res->write_i_lock, PTHREAD_PROCESS_PRIVATE);
    pthread_spin_init(&res->read_i_lock, PTHREAD_PROCESS_PRIVATE);
    pthread_spin_init(&res->waiting_readers_lock, PTHREAD_PROCESS_PRIVATE);
    pthread_spin_init(&res->waiting_writers_lock, PTHREAD_PROCESS_PRIVATE);
    res->capacity = capacity;
    res->element_size = element_size;
    return res;
}

void ring_buffer_destroy(ring_buffer *buffer)
{
    free(buffer->buffer);
    free(buffer->states);
    free(buffer);
}

static inline void ring_buffer_inc_index(ring_buffer *buffer, int *index)
{
    *index = (*index + 1) % buffer->capacity;
}

void timespec_now_plus_ms(struct timespec *result, long ms_to_add)
{
    const int one_second_us = 1000 * 1000 * 1000;
    timespec_get(result, TIME_UTC);
    const long nsec = result->tv_nsec + ms_to_add * 1000 * 1000;
    result->tv_sec  += nsec / one_second_us;
    result->tv_nsec += nsec % one_second_us;
}

const void *ring_buffer_read_acquire(ring_buffer *buffer, ring_buffer_loc *loc)
{
    bool is_waiting = false;
start:
    pthread_spin_lock(&buffer->read_i_lock);
    const int read_i = buffer->read_i;
    pthread_spinlock_t *cell_lock = &buffer->locks[read_i];
    pthread_spin_lock(cell_lock);

    const int state = buffer->states[read_i];
    if (state == NONE || state == WRITING || state == READING) {
        if (!is_waiting) {
            is_waiting = true;
            pthread_spin_lock(&buffer->waiting_readers_lock);
            ++buffer->waiting_readers;
            pthread_mutex_lock(&buffer->value_written_lock);
            pthread_spin_unlock(&buffer->waiting_readers_lock);
        } else {
            pthread_mutex_lock(&buffer->value_written_lock);
        }

        pthread_spin_unlock(cell_lock);
        pthread_spin_unlock(&buffer->read_i_lock);

        // "new" code:
        // struct timespec ts;
        // do {
        //     timespec_now_plus_ms(&ts, 50);
        // } while (pthread_cond_timedwait(&buffer->value_written, &buffer->value_written_lock, &ts) == ETIMEDOUT && buffer->states[read_i] == state);
        // pthread_mutex_unlock(&buffer->value_written_lock);
        // "old" code (which hangs quite often):
        pthread_cond_wait(&buffer->value_written, &buffer->value_written_lock);
        pthread_mutex_unlock(&buffer->value_written_lock);

        goto start;
    } else if (state == SET) {
        if (is_waiting) {
            pthread_spin_lock(&buffer->waiting_readers_lock);
            --buffer->waiting_readers;
            assert(buffer->waiting_readers >= 0);
            pthread_spin_unlock(&buffer->waiting_readers_lock);
        }

        buffer->states[read_i] = READING;
        ring_buffer_inc_index(buffer, &buffer->read_i);
        pthread_spin_unlock(&buffer->read_i_lock);
        pthread_spin_unlock(cell_lock);
        *loc = read_i;
        return &buffer->buffer[read_i * buffer->element_size];
    } else {
        printf("unknown state!\n");
        exit(1);
    }
}

void ring_buffer_read_finish(ring_buffer *buffer, ring_buffer_loc loc)
{
    pthread_spinlock_t *cell_lock = &buffer->locks[loc];
    pthread_spin_lock(cell_lock);
    buffer->states[loc] = NONE;
    pthread_spin_unlock(cell_lock);

    pthread_spin_lock(&buffer->waiting_writers_lock);
    if (buffer->waiting_writers > 0) {
        pthread_cond_signal(&buffer->value_read);
    }
    pthread_spin_unlock(&buffer->waiting_writers_lock);
}

void *ring_buffer_write_acquire(ring_buffer *buffer, ring_buffer_loc *loc)
{
    bool is_waiting = false;
start:
    pthread_spin_lock(&buffer->write_i_lock);
    const int write_i = buffer->write_i;
    pthread_spinlock_t *cell_lock = &buffer->locks[write_i];
    pthread_spin_lock(cell_lock);

    const int state = buffer->states[write_i];
    if (state == SET || state == READING || state == WRITING) {
        if (!is_waiting) {
            is_waiting = true;
            pthread_spin_lock(&buffer->waiting_writers_lock);
            ++buffer->waiting_writers;
            pthread_mutex_lock(&buffer->value_read_lock);
            pthread_spin_unlock(&buffer->waiting_writers_lock);
        } else {
            pthread_mutex_lock(&buffer->value_read_lock);
        }

        pthread_spin_unlock(cell_lock);
        pthread_spin_unlock(&buffer->write_i_lock);

        // "new" code:
        // struct timespec ts;
        // do {
        //     timespec_now_plus_ms(&ts, 5);
        // } while (pthread_cond_timedwait(&buffer->value_read, &buffer->value_read_lock, &ts) == ETIMEDOUT && buffer->states[write_i] == state);
        // pthread_mutex_unlock(&buffer->value_read_lock);
        // "old" code (which hangs quite often):
        pthread_cond_wait(&buffer->value_read, &buffer->value_read_lock);
        pthread_mutex_unlock(&buffer->value_read_lock);

        goto start;
    } else if (state == NONE) {
        if (is_waiting) {
            pthread_spin_lock(&buffer->waiting_writers_lock);
            --buffer->waiting_writers;
            assert(buffer->waiting_writers >= 0);
            pthread_spin_unlock(&buffer->waiting_writers_lock);
        }

        buffer->states[write_i] = WRITING;
        ring_buffer_inc_index(buffer, &buffer->write_i);
        pthread_spin_unlock(&buffer->write_i_lock);
        pthread_spin_unlock(cell_lock);
        *loc = write_i;
        return &buffer->buffer[write_i * buffer->element_size];
    } else {
        printf("unknown state!\n");
        exit(1);
    }
}

void ring_buffer_write_finish(ring_buffer *buffer, ring_buffer_loc loc)
{
    pthread_spinlock_t *cell_lock = &buffer->locks[loc];
    pthread_spin_lock(cell_lock);
    buffer->states[loc] = SET;
    pthread_spin_unlock(cell_lock);

    pthread_spin_lock(&buffer->waiting_readers_lock);
    if (buffer->waiting_readers > 0) {
        pthread_cond_signal(&buffer->value_written);
    }
    pthread_spin_unlock(&buffer->waiting_readers_lock);
}

/* just for debugging */
void ring_buffer_dump(const ring_buffer *buffer)
{
    printf("RingBuffer\n");
    printf(" Capacity:     %d\n", buffer->capacity);
    printf(" Element size: %d\n", buffer->element_size);
    printf(" Read index:   %d\n", buffer->read_i);
    printf(" Write index:  %d\n", buffer->write_i);
    printf(" Cells:\n");
    for (int i = 0; i < buffer->capacity; ++i) {
        printf(" [%d]: STATE = ", i);
        switch (buffer->states[i]) {
            case NONE:
                printf("NONE");
                break;
            case WRITING:
                printf("WRITING");
                break;
            case READING:
                printf("READING");
                break;
            case SET:
                printf("SET");
                break;
        }
        printf("\n");
    }
    printf("\n");
}



/*
 * Test run
 */

struct write_read_n_conf {
    ring_buffer *buffer;
    int n;
};
 
static void *producer_thread(void *arg)
{
    struct write_read_n_conf conf = *(struct write_read_n_conf *)arg;
    for (int i = 0; i < conf.n; ++i) {
        ring_buffer_loc loc;
        int *value = ring_buffer_write_acquire(conf.buffer, &loc);
        *value = i;
        ring_buffer_write_finish(conf.buffer, loc);

        if (i % 1000 == 0) {
            printf("%d / %d\n", i, conf.n);
        }
    }
    return NULL;
}

static void *consumer_thread(void *arg)
{
    struct write_read_n_conf conf = *(struct write_read_n_conf *)arg;
    int tmp;
    bool ok = true;
    for (int i = 0; i < conf.n; ++i) {
        ring_buffer_loc loc;
        const int *value = ring_buffer_read_acquire(conf.buffer, &loc);
        tmp = *value;
        ring_buffer_read_finish(conf.buffer, loc);

        ok = ok && (tmp == i);
    }
    printf("ok = %d\n", ok);
    return (void *)ok;
}

void write_read_n_parallel(int n)
{
    ring_buffer *buffer = ring_buffer_create(50, sizeof(int));
    struct write_read_n_conf conf = {
        .buffer = buffer,
        .n      = n
    };

    pthread_t consumer;
    pthread_t producer;

    pthread_create(&consumer, NULL, consumer_thread, &conf);
    pthread_create(&producer, NULL, producer_thread, &conf);

    pthread_join(producer, NULL);

    void *res;
    pthread_join(consumer, &res); // hacky way to pass a bool: res == NULL means false, and otherwise true
    assert(res != NULL);
}   

int main() {
    write_read_n_parallel(10000000);
}

Queues are normally implemented with condition variables (pthread_cont_t) or semaphores. — n. m. could be an AI, Sep 06 '20 at 12:16
It is a really bad practice to not initialize your condvars & mutexes. — mevets, Sep 06 '20 at 13:53
condvars have no memory, so inbetween line 101 `pthread_spin_unlock(&buffer->waiting_readers_lock)` and line 116 `pthread_cond_wait(&buffer->value_written, &buffer->value_written_lock)` in one thread, another thread could execute line 218 `pthread_cond_signal(&buffer->value_written)` and this signal would be missed, thus the cond_wait would wait forever. — mevets, Sep 06 '20 at 14:02
If your mutex based version is too slow under heavy load, then I wouldn't expect a spin-lock based version to perform better. If you want to achieve high throughput with a lot of parallel threads you should consider a lock-free or at least a lock-less version. However, a bounded lock-free ringbuffer based queue is not trivial, and also lock-free as well as lock-less versions do not provide blocking push/pull operations - these could be simulated by continuously retrying though (ideally with a backoff). — mpoeter, Sep 07 '20 at 15:42
For a lock-less version you could take a look at the one by Dmitry Vyukov: http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue A lock-free version has been proposed by Nikolaev in this paper: https://arxiv.org/abs/1908.04511 — mpoeter, Sep 07 '20 at 15:44
Finally, I've translates the queue described by Dmitry Vyukuv to C and it works great. :) — Kevin Meier, Sep 19 '20 at 08:42

C: Multi producer, multi consumer bounded queue

0 Answers0