Find first unset bit in buffer (optimization)

Question

What's the fastest/cleanest way to find the bit offset of the first unset bit in an array of arbitrary length?

Assume the prototype for your function is something like this size_t first_unset_bit(char unsigned const *buf, size_t bit_count, size_t start_bit); and that it may be called several times in quick succession on the same buffer. If you can give a better prototype, please justify.

If you use assembly any assembly, please provide a x86 sample that will run on a core2 or later. I will award the answer to the solution that provides the best combination of speed and beauty.

Update0

Here's my naive implementation. I have no idea if it's actually right yet, it's not yet used in a live system.

static size_t first_unset_bit(char unsigned const *buf, size_t bit_count, size_t start_bit)
{
    for (; start_bit < bit_count; ++start_bit)
    {
        size_t buf_index = start_bit / CHAR_BIT;
        int bit_index = start_bit % CHAR_BIT;
        if (!((buf[buf_index] >> bit_index) & 1))
            return start_bit;
    }
    return -1;
}

Is the array likely to contain lots of unset bits or lots of set bits? ie. will the most common case be all bits set, no bits set, most bits set, most bits unset, etc. — Lasse V. Karlsen, Jul 30 '10 at 16:25
Haha, no it's not homework. The common case will tend towards mostly unset: large contiguous blocks are set, with holes of varying sizes here and there (mostly smaller). The buffer is a bitmap of block allocations in a filesystem. — Matt Joiner, Jul 30 '10 at 16:29
Also, what endian-ness (if that matters here) is the target system? ie. if the first 8 bytes are 128,255,255,255,255,255,255,255, what is the "bit offset" of this? 0 or 7 or something entirely different? — Lasse V. Karlsen, Jul 30 '10 at 16:33
Since they are mostly unset, then any optimization that assumes mostly set would be useless, as it seems you won't loop far before you find an unset bit. Am I missing something? — James Black, Jul 30 '10 at 16:34
@James Black, doubtful, but that was the whole point of my question. I don't work in C or assembly these days, but I'd imagine that the most common case would dictate what kinds of optimizations makes sense and which doesn't. "Mostly unset" would probably rule out an initial loop looking at 32 or 64-bit values (based on the CPU), because then skipping over 4 or 8 bytes of "all set" bits would be much faster than looking at each individual byte. I just figured the answer to such a question would be important. — Lasse V. Karlsen, Jul 30 '10 at 16:42
Hm, perhaps saying they're mostly unset is misleading. The entire bitmap is mostly unset, but probably not the buffers passed to the function. Generally these will be carefully looking to fill spaces that have subsequently been unset. I'm not sure how best then to make your assumptions about set/unset. — Matt Joiner, Jul 30 '10 at 16:43
@Lasse V. Karlsen - I was going to suggest skipping over 4-8 bytes at a time, until I realized that it wouldn't gain anything, hence my comment on useless optimizations. — James Black, Jul 30 '10 at 16:45
Lasse: Good question, since there seems to be no agreed "bit direction" within a byte (an answer I got slammed for here: http://stackoverflow.com/questions/3344839/fastest-way-to-read-left-most-bit-from-unsigned-int-c/3344900#3344900) you can see from my naive example that 128,255,... would return 7. — Matt Joiner, Jul 30 '10 at 16:51
This just begs the question: What do you *want*? You should have a spec that says "with this particular example, the result is 17", etc. (and also, I don't know if I'm using the phrase "begs the question" correctly...) — Lasse V. Karlsen, Jul 30 '10 at 17:02
@Matt Joiner: Bitwise not operator, and "first one bit" becomes "first zero bit". — Merlyn Morgan-Graham, Jul 31 '10 at 12:48
@Matt Joiner: http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious If you want this to no longer be position, but the actual bit, in place, then just bit-shift 1. — Merlyn Morgan-Graham, Aug 01 '10 at 00:28

score 2 · Answer 1 · answered Jul 30 '10 at 16:29

2

In x-86 assembly language,

REPE SCAS 0xFFFFFFFF

...is likely to form an important part of the answer!

You have no choice but to examine every bit preceding the first unset bit, so it's just down to how quickly you can do that. Comparing 32 bits at a time is a good start, and once you get down to knowing which WORD contains the first unset bit you can use a combination of shifts/lookuptables to locate the first unset bit in that word.

answered Jul 30 '10 at 16:29

Roddy

66,617
42
165
277

This looks very promising. What's the chance that GCC will generate this anyway in higher levels of optimization? I'll try it out. – Matt Joiner Jul 30 '10 at 16:46
This is your best chance of getting `repe scasd` with plain C: `wchar_t c[2] = { -1, 0 }; size_t cnt = wcsspn(buf, c) * sizeof(wchar_t);` – R.. GitHub STOP HELPING ICE Jul 30 '10 at 17:30
3

Note that the Linux folks discarded a routine using repe scas in favor of plain C: LKML thread: http://lkml.indiana.edu/hypermail/linux/kernel/0803.1/0344.html – user382751 Jul 30 '10 at 18:20
@R..: Clever, use of 32bit wchar_t in a library function, very good observation. – Matt Joiner Jul 30 '10 at 20:57
Yeah, sadly it depends on `wcsspn` being optimized for the case of single-character second argument; I'm not sure how common this is. I only thought of it because my own implementation has this optimization (but not the asm; it's plain C). – R.. GitHub STOP HELPING ICE Jul 30 '10 at 23:32
REPE SCAS isn't necessarily the best choice on all processors. – peterchen Aug 05 '10 at 15:44

score 2 · Accepted Answer · answered Jul 30 '10 at 17:44

2

Linux has what I imagine to be a highly tuned implementation under the name "find_first_zero_bit".

answered Jul 30 '10 at 17:44

user382751

1,213
8
10

http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.34.y.git;a=blob;f=lib/find_next_bit.c;h=24c59ded47a05d5617e2fe2501077f990de1ec4b;hb=master and http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.34.y.git;a=blob;f=arch/x86/include/asm/bitops.h;h=02b47a603fc88ee76f316989bc0475cb89be5e24;hb=master for ffz – user382751 Jul 30 '10 at 17:58
Excellent, thanks. This sets a definite benchmark. To add to that, it's extremely clever and involves some nice architecture specific assembly speedups. – Matt Joiner Jul 30 '10 at 18:03
Links are dead. – merlin2011 Apr 17 '17 at 09:06
**New Links:** https://elixir.bootlin.com/linux/latest/source/include/linux/find.h#L163 in c and https://elixir.bootlin.com/linux/latest/source/arch/m68k/include/asm/bitops.h#L332 in asm. – Fil Nov 21 '22 at 12:24

score 2 · Answer 3 · answered Jul 30 '10 at 18:01

2

Optimization hint: create lookup table that maps byte value to first unset bit than loop bytes but not bits.

answered Jul 30 '10 at 18:01

olegz

1,189
10
20

score 2 · Answer 4 · answered Aug 04 '10 at 02:31

Often overlooked, strings.h (yes, that standard header) contains a bunch of functions: ffs, ffsl and the likes, see here for more information. At least with gcc and on x86, this compiles to the corresponding one-cycle instruction, for instance BSFL.

I would thus suggest:

add a sentinel 0xFFFF at the end of your array
divide bit_count by 4 (so your iterating over 4-byte blocks instead of bytes)
use a while loop to find the block with the first set bit

For example:

cursor = start_pos;
while(position = ffsl(buf))
  cursor++;
return (cursor - startpos) * 32 + pos;

(Except you have to test whether youve reached the sentinel, in which case the buffer is blank.)

Although you should take this with a grain of salt because I don't purport to be an assembly expert... you'd basically use little more than 3 cycles for every 32 bits (one incrementation, one comparison, one BSFL instruction), and imagine you can do better using the long version of the function.

Your code seems a bit off, but the presence of this in the C library is great to know. I'll post an update with this in use. — Matt Joiner, Aug 04 '10 at 06:56
I add a sentinel at the end of the array so I only have to make one test in the while loop: I only have to test whether "position = ffsl(buf)" is different from zero (which is what ffsl will return if there are not bits set in the buffer). Without the sentinel, I would have, in addition, to make sure I'm not going out of bounds. There is indeed a flagrant error in the code: I am not increasing the address of the buffer. Actually you should get rid of "cursor" and replace all instances of "cursor" by "buf". (Or at least that's the idea.) — Jérémie, Aug 04 '10 at 13:08

zwol · Answer 5 · 2010-07-30T22:24:16.983

Without using any assembly language, but with GCC builtins, and assuming that bit_count is a multiple of the number of bits in a long, something like this should work. I changed your function to take a void* buffer argument to avoid aliasing problems. Totally untested, I might have messed up the math especially in the leading "if (start_bit % LONG_BIT) block.

#include <stddef.h>
#include <limits.h>
#define LONG_BIT (CHAR_BIT * sizeof(unsigned long))

size_t
first_unset_bit(const void *buf, size_t bit_count, size_t start_bit)
{
    size_t long_count = bit_count / LONG_BIT;
    size_t start_long = start_bit / LONG_BIT;

    const unsigned long *lbuf = (const unsigned long *)buf;

    if (start_bit % LONG_BIT)
    {
        size_t offset = start_bit % LONG_BIT;
        unsigned long firstword = lbuf[start_long];
        firstword = ~(firstword | ~((1UL << offset) - 1));
        if (firstword)
            return start_bit - offset + __builtin_clzl(firstword);

        start_long += 1;
    }

    for (size_t i = start_long; i < long_count; i++)
    {
        unsigned long word = lbuf[i];
        if (~word)
            return i*LONG_BIT + __builtin_clzl(~word);
    }
    return bit_count + 1; // not found
}

I may have misremembered the rules. I have had compilers object to converting a `char*` to a `long*` and then dereferencing the converted pointer, but just now gcc doesn't seem to mind. — zwol, Jul 30 '10 at 22:25

score 0 · Answer 6 · answered Jul 30 '10 at 16:30

The obvious solution is to just loop through from start_bit until you either get to the end of the array or find an unset bit.

Because it can be of arbitrary length you can't just turn it into a number and find the value that way, as it may likely be largr than the size of a double.

score 0 · Answer 7 · answered Jul 30 '10 at 17:27

I'm assuming your buffer is aligned, e.g. a buffer returned by malloc. If not you'll need to scan the unaligned part at the beginning first.

uint32_t *p = (void *)buf;
while (!(*p+1)) p++;
size_t cnt = (unsigned char *)p - buf << CHAR_BIT;
if (*p>=0xFFFF0000)
  if (*p>=0xFFFFFF00)
    if (*p>=0xFFFFFFF0)
      if (*p>=0xFFFFFFFC)
        if (*p>=0xFFFFFFFE) cnt+=31;
        else cnt+=30;
      else
        if (*p>=0xFFFFFFF9) cnt+=29;
        else cnt+=28;
    else
      if (*p>=0xFFFFFFC0)
        if (*p>=0xFFFFFFE0) cnt+=27;
        else cnt+=26;
      else
        if (*p>=0xFFFFFF90) cnt+=25;
        else cnt+=24;
  else
    ...

I'll leave filling in the rest of the binary search to you.

score 0 · Answer 8 · answered Jul 31 '10 at 11:42

As others have mentionned, assembly language may give the best performance. If that is not an option, you may wish to consider the following (untested) routine. It is not quite that for which you asked, but it should be close enough that you can adapt it to your needs.

size_t findFirstNonFFbyte (
    unsigned char const *buf,       /* ptr to buffer in which to search */
    size_t               bufSize,   /* length of the buffer */
    size_t               startHint  /* hint for the starting byte (<= bufSize) */
    ) {
    unsigned char * pBuf = buf + startHint;
    size_t          bytesLeft;

    for (bytesLeft = bufSize - startHint;
         bytesLeft > 0;
         bytesLeft = startHint, pBuf = buf) {
        while ((bytesLeft > 0) && (*pBuf == 0xff)) {
            *pBuf++;
            bytesLeft--;
        }

        if (bytesLeft > 0) {
            return ((int) (pBuf - buf));
        }
    }
    return (-1);
}

To use ...

index = findFirstNonFFbyte (...);
bit_index = index + bitTable[buffer[index]];

Additional Notes:

The above code will check 8 bits at a time. If you know your buffer will be 4-byte aligned and its length an even multiple of 4 bytes, then you can test 32-bits at a time with a little tweaking (don't forget the return value calculation).

If your starting bit is not a hint but an absolute, then you can skip the for-loop.

You will need to provide your own bit lookup table. It should be an array 256 bytes long. Each entry identifies the first clear bit of the byte that indexes that entry. Personal experience tells me that different people will number those bits differently. Some call bit 0 the most signficant bit of the byte; others will call bit 0 the least significant bit of the byte. Whatever style you pick, be sure to be consistent.

Hope this helps.

score 0 · Answer 9 · answered Jul 31 '10 at 12:13

use the gcc builtin equivalent of Microsoft's _BitScanReverse, I use something like this to find the first free bit(representing block usage) for my memory system:

        __forceinline DWORD __fastcall GetNextFreeBlockIndex(PoolBlock* pPoolBlock)
        {
            DWORD dwIndex;
            DWORD dwOffset = 0;
            DWORD* pUsage = &pPoolBlock->fUsage[0];
            while(dwOffset < MMANAGER_BLOCKS_PER_POOL)
            {
                DWORD dwUsage = *pUsage;
                if(dwUsage != 0xFFFFFFFF && _BitScanForward(&dwIndex,~dwUsage))
                {
                    #if !( MMANAGER_ATOMIC_OPS )
                        pPoolBlock->pSync.Enter();
                    #endif

                    ATOMIC_Write(DWORD,pPoolBlock->dwFreeIndex,dwOffset);
                    ATOMIC_Write(DWORD*,pPoolBlock->pFreeUsage,pUsage);

                    #if !( MMANAGER_ATOMIC_OPS )
                        pPoolBlock->pSync.Leave();
                    #endif

                    return dwIndex + dwOffset;
                }

                pUsage++;
                dwOffset += 32;
            }

            return 0xFFFFFFFF;
        }

        __forceinline DWORD __fastcall GetFreeBlockIndex(PoolBlock* pPoolBlock)
        {
            DWORD dwIndex;
            DWORD dwUsage = *pPoolBlock->pFreeUsage;
            if(dwUsage == 0xFFFFFFFF)
                return GetNextFreeBlockIndex(pPoolBlock);

            if(_BitScanForward(&dwIndex,~dwUsage))
                return dwIndex + pPoolBlock->dwFreeIndex;

            return 0xFFFFFFFF;
        }

excuse the tabbing, this is straight outta some #if/#endif VS code. ofc this code is made just for DWORDS 's, you can just do block_size & 3 to find if there are any odd bytes, copy those odd bytes to a DWORD and scan the DWORD, then cut of any results greater than (block_size & 3) << 3

what's the point of `__fastcall` if you're forcing the code to inline? :P also it's well known that fastcall isn't actually fast — Matt Joiner, Aug 04 '10 at 02:40
Its merely for debugging purposes(aka disabling of inlines in debug mode), I just prefer running through __fastcall funcs in ollydbg as apposed to __stdcall(don't ask, just some weird habit I picked up when I write inlines in MSVC...) — Necrolis, Aug 04 '10 at 07:12

Find first unset bit in buffer (optimization)

Update0

9 Answers9