improving C circular buffer efficiency

Question

I'd like some help improving the efficiency of my circular buffer code.

I had a look around stackoverflow and found that (nearly) all of the topics on circular buffers are about the uses of such a buffer or the basic implementation of a circular buffer. I really need information about how to make it super efficient.

The plan is to use this buffer with the STM32F4 microcontroller which has a single precicion FPU. I plan to make heavy use of especially the write() and readn() functions. We're literally talking a few million calls a second here so shaving of a few clock cycles here and there is really going to make a difference.

I'll put the most important bits of code here, the full buffer code is available via http://dl.dropbox.com/u/39710897/circular%20buffer.rar

Can anyone provide me with a few pointers on how to improve the efficiency of this buffer?

#define BUFF_SIZE 3             // buffer size set at compile time

typedef struct buffer{
    float buff[BUFF_SIZE];
    int readIndex;
    int writeIndex;
}buffer;

/********************************\
* void write(buffer* buffer, float value)
* writes value into the buffer
* @param buffer* buffer
*   pointer to buffer to be used
* @param float value
*   valueto be written in buffer
\********************************/
void write(buffer* buffer,float value){
    buffer->buff[buffer->writeIndex]=value;
    buffer->writeIndex++;
    if(buffer->writeIndex==BUFF_SIZE)
        buffer->writeIndex=0;
}

/********************************\
* float readn(buffer* buffer, int Xn)
* reads specified value from buffer
* @param buffer* buffer
*   pointer to buffer to be read from
* @param int Xn
*   specifies the value to be read from buffer counting backwards from the most recently written value
*   i.e. the most recently writen value can be read with readn(buffer, 0), the value written before that with readn(buffer, 1)
\********************************/
float readn(buffer* buffer, int Xn){
    int tempIndex;

    tempIndex=buffer->writeIndex-(Xn+1);
    while(tempIndex<0){
        tempIndex+=BUFF_SIZE;
    }

    return buffer->buff[tempIndex];
}

readindex is indeed not used in the functions listed above. It is used though in the read() function that can be found in the attached rar file. The readn() function listed here serves to read a specific value from the buffer (ie the second last written value) — Gurba, Mar 15 '12 at 11:00

score 16 · Accepted Answer · edited Mar 15 '19 at 18:45

16

As "Oli Charlesworth" suggested - you'd be able to simplify things if your buffer size is a power of 2. I'd like to write the read/write function bodies, so that the intent is more clear.

#define BUFF_SIZE (4U)
#define BUFF_SIZE_MASK (BUFF_SIZE-1U)

struct buffer {
    float buff[BUFF_SIZE];
    unsigned writeIndex;
};

void write(struct buffer *buffer, float value) {
    buffer->buff[(++buffer->writeIndex) & BUFF_SIZE_MASK] = value;
}

float readn(struct buffer *buffer, unsigned Xn){
    return buffer->buff[(buffer->writeIndex - Xn) & BUFF_SIZE_MASK];
}

Some explanations. Note that there's no branching (if) at all. We don't limit the array index to the array bounds, instead we're AND-ing it with the mask.

edited Mar 15 '19 at 18:45

Evan Benn

1,571
2
14
20

answered Mar 15 '12 at 11:05

valdo

12,632
2
37
67

3

As soon as you use bitmasks you should simply switch to `unsigned` index types. Then you don't have to argue about 2-complement or stuff like that. – Jens Gustedt Mar 15 '12 at 12:14
2

I thank you a LOT, this has made my code roughly 6.5 times faster! I'd call that a rather dramatic improvement – Gurba Mar 15 '12 at 12:34
since I can set only 1 answer as accepted, I'll to that with this one because of the examples even though Oli Charlesworth came up with a similar answer. – Gurba Mar 15 '12 at 12:37
It might make sense to try to make ReadArray and WriteArray functions that should work with arrays instead of single values. Or to make write and readn as macroses or inline procedures (in case of C++ compiler). In that case you avoid some stack workload, that may give you cpu ticks. – Mtr Mar 29 '13 at 07:32
2

@valdo writeIndex will eventually overflow – 4pie0 Feb 07 '16 at 11:30
As @4pie0 noted, this answer is incorrect C code, and will invoke undefined behaviour. I tried to edit the answer: https://stackoverflow.com/review/suggested-edits/22381560 but was rejected. Please do not use this code. – Evan Benn Mar 08 '19 at 01:13
@EvanBenn: why do you think it invokes undefined behavior? There is no problem with `writeIndex` overflow. – valdo Mar 08 '19 at 06:30
When you say there is no problem, what do you mean? If signed `writeIndex` overflows that is undefined behaviour. unsigned integers wrap around. – Evan Benn Mar 10 '19 at 23:16

score 12 · Answer 2 · answered Mar 15 '12 at 10:46

12

If you can make your buffer size a power-of-2, then the check against zero can be replaced with unconditional bit-masking. On most processors, this should be faster.

answered Mar 15 '12 at 10:46

Oliver Charlesworth

267,707
33
569
680

1

Agree. Also there'll be no need to check for index over/under-flow at all. Just increment and decrement it without any checking, just AND-mask it for the actual element access – valdo Mar 15 '12 at 10:53
I'll be investigating this, making it a power of 2 is easy enough (just set BUFF_SIZE to 4). I'll have to dig a bit for the bit-masking though – Gurba Mar 15 '12 at 11:01
1

@Gurba I assume the final application will have a much larger buffer size? Otherwise I would strongly question the use of the ring buffer ADT in the first place. – Lundin Mar 15 '12 at 11:59

Pavan Manjunath · Answer 3 · 2012-03-15T11:51:35.417

2

This may not be seem elegant but is efficient. Accessing structure elements through the pointer is taking up a lot of instructions. Why not remove the structure altogether and make buffer and writeIndex as global variables? This will considerably decrease the size of your readn and write functions.

I tried in gcc and here is the output with and without the structure

With Structure

_write:
    pushl   %ebp
    movl    %esp, %ebp
    movl    8(%ebp), %ecx
    movl    8(%ebp), %eax
    movl    16(%eax), %edx
    movl    12(%ebp), %eax
    movl    %eax, (%ecx,%edx,4)
    movl    8(%ebp), %eax
    incl    16(%eax)
    movl    8(%ebp), %eax
    cmpl    $3, 16(%eax)
    jne L1
    movl    8(%ebp), %eax
    movl    $0, 16(%eax)
L1:
    popl    %ebp
    ret

Without Structure. ie Making buffer and writeIndex as global

_write:
    pushl   %ebp
    movl    %esp, %ebp
    movl    _writeIndex, %edx
    movl    8(%ebp), %eax
    movl    %eax, _buff(,%edx,4)
    incl    _writeIndex
    cmpl    $3, _writeIndex
    jne L1
    movl    $0, _writeIndex
L1:
    popl    %ebp
    ret

edited Mar 15 '12 at 11:51

answered Mar 15 '12 at 11:43

Pavan Manjunath

27,404
12
99
125

With optimizations enabled there shouldn't be such a big difference. One may resolve once the address of `buff` and `writeIndex`, all the rest should be the same. OTOH without optimizations any access such as `buff->...` would be long – valdo Mar 15 '12 at 11:51
2

This won't work if the OP needs more than one buffer instance. – Oliver Charlesworth Mar 15 '12 at 12:01
@valdo Hmmm.. But the compiler seems to be doing something else. I tried with `-O2` and the resultant code seems to be much bigger than the above ones :O – Pavan Manjunath Mar 15 '12 at 12:02
@OliCharlesworth You are right. He would then have to create more global buffers/indices which would go on becoming clumsier. That's why I added the `not so elegant` phrase in my answer :) – Pavan Manjunath Mar 15 '12 at 12:06
I'll be keeping this one in mind, but I doubt that I'll use it. I do indeed need multiple instances and this just seems unmaintanable to me.. I'll just get confused too often – Gurba Mar 15 '12 at 12:12
The assembler really looks ineffient. I am sure you don't use the correct options for a modern platform. (-O3 -march=native should do for gcc). Also, the lengthy preamble is in that assembler because this is seen as a function. when declared as `inline` and then inlined into the call side, most of this noise should go away. – Jens Gustedt Mar 15 '12 at 12:20
@Stacker, I meant "assembler" as a short hand for "assembly listing". – Jens Gustedt Mar 15 '12 at 19:02

score 2 · Answer 4 · answered Mar 15 '12 at 11:54

Keeping track of the start and the end of the circular buffer with pointers, is likely a bit faster than array indexing, since the address will computed in runtime in case of the latter. Try to replace readIndex and writeIndex with float* instead. The code would then be

*buffer->writeIndex = value;
buffer->writeIndex++;
if(buffer->writeIndex == buffer + BUFF_SIZE)
  buffer->writeIndex=buffer->buff;

buffer + BUFF_SIZE will still be a constant expression and the compiler will translate that to a fixed address at compile time.

score 1 · Answer 5 · answered Mar 08 '19 at 01:30

The accepted answer contains code that is incorrect and will invoke undefined behavior. Correction below:

#define BUFF_SIZE (4U)
#define BUFF_SIZE_MASK (BUFF_SIZE-1U)

struct buffer {
    float buff[BUFF_SIZE];
    unsigned writeIndex;
};

void write(struct buffer *buffer, float value) {
    buffer->buff[(++buffer->writeIndex) & BUFF_SIZE_MASK] = value;
}

float readn(struct buffer *buffer, unsigned Xn){
    return buffer->buff[(buffer->writeIndex - Xn) & BUFF_SIZE_MASK];
}

The error in the original answer was to assume that 'int' will wrap around. Using binary masks with int is also unwise.

improving C circular buffer efficiency

5 Answers5