Efficient algorithm for finding a byte in a bit array

Question

Given a bytearray uint8_t data[N] what is an efficient method to find a byte uint8_t search within it even if search is not octet aligned? i.e. the first three bits of search could be in data[i] and the next 5 bits in data[i+1].

My current method involves creating a bool get_bit(const uint8_t* src, struct internal_state* state) function (struct internal_state contains a mask that is bitshifted right, &ed with src and returned, maintaining size_t src_index < size_t src_len) , leftshifting the returned bits into a uint8_t my_register and comparing it with search every time, and using state->src_index and state->src_mask to get the position of the matched byte.

Is there a better method for this?

This is hard to do in well-defined c. You can't assume there are 8 bits in a byte. I'd be tempted to use an assembly based solution. — Bathsheba, May 11 '15 at 18:52
Maybe you can find some inspiration [here](http://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm#Shifting_substrings_search_and_competing_algorithms). It's not exactly the same, but conceptually similar. — mkrieger1, May 11 '15 at 18:55
Are overlapping bit patterns findable? I suggest converting `data` and `search` to strings (one byte per bit) and using `ptr = strstr(lastptr+1, search)` or `ptr = strstr(lastptr+8, search)` — Weather Vane, May 11 '15 at 19:22
If you're willing to forget about well-defined, portable C, you can probably speed things up by handling data in chunks of 32 or 64 bits, depending on the architecture of your machine. Then you'd have to deal with endian issues, though, especially on little-endian architectures such as x86. — Andrew Henle, May 11 '15 at 19:29
Are you willing to accept SSE intrinsics? (if so, up to which version?) — harold, May 11 '15 at 19:48

score 4 · Answer 1 · answered May 11 '15 at 19:50

If you're searching an eight bit pattern within a large array you can implement a sliding window over 16 bit values to check if the searched pattern is part of the two bytes forming that 16 bit value.

To be portable you have to take care of endianness issues which is done by my implementation by building the 16 bit value to search for the pattern manually. The high byte is always the currently iterated byte and the low byte is the following byte. If you do a simple conversion like value = *((unsigned short *)pData) you will run into trouble on x86 processors...

Once value, cmp and mask are setup cmp and mask are shifted. If the pattern was not found within hi high byte the loop continues by checking the next byte as start byte.

Here is my implementation including some debug printouts (the function returns the bit position or -1 if pattern was not found):

int findPattern(unsigned char *data, int size, unsigned char pattern)
{
    int result = -1;
    unsigned char *pData;
    unsigned char *pEnd;
    unsigned short value;
    unsigned short mask;
    unsigned short cmp;
    int tmpResult;



    if ((data != NULL) && (size > 0))
    {
        pData = data;
        pEnd = data + size;

        while ((pData < pEnd) && (result == -1))
        {
            printf("\n\npData = {%02x, %02x, ...};\n", pData[0], pData[1]);

            if ((pData + 1) < pEnd)   /* still at least two bytes to check? */
            {
                tmpResult = (int)(pData - data) * 8;   /* calculate bit offset according to current byte */

                /* avoid endianness troubles by "manually" building value! */
                value = *pData << 8;
                pData++;
                value += *pData;

                /* create a sliding window to check if search patter is within value */
                cmp = pattern << 8;
                mask = 0xFF00;
                while (mask > 0x00FF)   /* the low byte is checked within next iteration! */
                {
                    printf("cmp = %04x, mask = %04x, tmpResult = %d\n", cmp, mask, tmpResult);

                    if ((value & mask) == cmp)
                    {
                        result = tmpResult;
                        break;
                    }

                    tmpResult++;   /* count bits! */
                    mask >>= 1;
                    cmp >>= 1;
                }
            }
            else
            {
                /* only one chance left if there is only one byte left to check! */
                if (*pData == pattern)
                {
                    result = (int)(pData - data) * 8;
                }

                pData++;
            }
        }
    }

    return (result);
}

weaknespase · Answer 2 · 2015-05-11T20:14:08.840

I don't know if it would be better, but i would use sliding window.

uint counter = 0, feeder = 8;
uint window = data[0];

while (search ^ (window & 0xff)){
    window >>= 1;
    feeder--;
    if (feeder < 8){
        counter++;
        if (counter >= data.length) {
            feeder = 0;
            break;
        }
        window |= data[counter] << feeder;
        feeder += 8;
    }
}

//Returns index of first bit of first sequence occurrence or -1 if sequence is not found
return (feeder > 0) ? (counter+1)*8-feeder : -1;

Also with some alterations you can use this method to search for arbitrary length (1 to 64-array_element_size_in_bits) bits sequence.

John Bollinger · Answer 3 · 2015-05-11T20:00:34.663

2

I don't think you can do much better than this in C:

/*
 * Searches for the 8-bit pattern represented by 'needle' in the bit array
 * represented by 'haystack'.
 *
 * Returns the index *in bits* of the first appearance of 'needle', or
 * -1 if 'needle' is not found.
 */
int search(uint8_t needle, int num_bytes, uint8_t haystack[num_bytes]) {
    if (num_bytes > 0) {
        uint16_t window = haystack[0];

        if (window == needle) return 0;
        for (int i = 1; i < num_bytes; i += 1) {
            window = window << 8 + haystack[i];

            /* Candidate for unrolling: */
            for (int j = 7; j >= 0; j -= 1) {
                if ((window >> j) & 0xff == needle) {
                    return 8 * i - j;
                }
            }
        }
    }
    return -1;
}

The main idea is to handle the 87.5% of cases that cross the boundary between consecutive bytes by pairing bytes in a wider data type (uint16_t in this case). You could adjust it to use an even wider data type, but I'm not sure that would gain anything.

What you cannot safely or easily do is anything involving casting part or all of your array to a wider integer type via a pointer (i.e. (uint16_t *)&haystack[i]). You cannot be ensured of proper alignment for such a cast, nor of the byte order with which the result might be interpreted.

edited May 11 '15 at 20:00

answered May 11 '15 at 19:41

John Bollinger

160,171
8
81
157

1

If you use a wider data type - 64 bits, for example - you could issue a prefetch that loads `n[i+8]` through `n[i+15]` right as you start working on `n[i]` through `n[i+7]`. By the time you got through the first 7 bytes and started needing bits from the next set of data it would hopefully be in a register, ready for use, instead of stalling the CPU waiting for data to be loaded from memory. Dealing with endian issues would be tedious, but the OP did ask for an 'efficient algorithm', by which I take to mean 'fast'. – Andrew Henle May 11 '15 at 20:14
I wonder if it would be even faster if you replaced the inner loop with a table lookup? something like table[haystack[i-1]][haystack[i]] would replace some arithmetic with a memory access. My guess would be slower for small values of num_bytes, but faster once the table is in the data cache? – Peter de Rivaz May 11 '15 at 20:18
@AndrewHenle it'll auto-prefetch anyway since it's just a linear scan through memory, the TLB priming may help though – harold May 11 '15 at 20:22
@PeterdeRivaz, I'm not following. You could possibly replace the inner loop with *eight* table lookups, but you're still going to need some arithmetic (masking). You could also simply unroll the inner loop (per the source comment) if that turns out to be a win and the compiler doesn't do it for you. Any way around, you need eight comparisons for each byte in `haystack` after the zeroth. – John Bollinger May 11 '15 at 20:45

score 1 · Answer 4 · answered May 11 '15 at 22:00

If AVX2 is acceptable (with earlier versions it didn't work out so well, but you can still do something there), you can search in a lot of places at the same time. I couldn't test this on my machine (only compile) so the following is more to give to you an idea of how it could be approached than copy&paste code, so I'll try to explain it rather than just code-dump.

The main idea is to read an uint64_t, shift it right by all values that make sense (0 through 7), then for each of those 8 new uint64_t's, test whether the byte is in there. Small complication: for the uint64_t's shifted by more than 0, the highest position should not be counted since it has zeroes shifted into it that might not be in the actual data. Once this is done, the next uint64_t should be read at an offset of 7 from the current one, otherwise there is a boundary that is not checked across. That's fine though, unaligned loads aren't so bad anymore, especially if they're not wide.

So now for some (untested, and incomplete, see below) code,

__m256i needle = _mm256_set1_epi8(find);
size_t i;
for (i = 0; i < n - 6; i += 7) {
    // unaligned load here, but that's OK
    uint64_t d = *(uint64_t*)(data + i);
    __m256i x = _mm256_set1_epi64x(d);
    __m256i low = _mm256_srlv_epi64(x, _mm256_set_epi64x(3, 2, 1, 0));
    __m256i high = _mm256_srlv_epi64(x, _mm256_set_epi64x(7, 6, 5, 4));
    low = _mm256_cmpeq_epi8(low, needle);
    high = _mm256_cmpeq_epi8(high, needle);
    // in the qword right-shifted by 0, all positions are valid
    // otherwise, the top position corresponds to an incomplete byte
    uint32_t lowmask = 0x7f7f7fffu & _mm256_movemask_epi8(low);
    uint32_t highmask = 0x7f7f7f7fu & _mm256_movemask_epi8(high);
    uint64_t mask = lowmask | ((uint64_t)highmask << 32);
    if (mask) {
        int bitindex = __builtin_ffsl(mask);
        // the bit-index and byte-index are swapped
        return 8 * (i + (bitindex & 7)) + (bitindex >> 3);
    }
}

The funny "bit-index and byte-index are swapped" thing is because searching within a qword is done byte by byte and the results of those comparisons end up in 8 adjacent bits, while the search for "shifted by 1" ends up in the next 8 bits and so on. So in the resulting masks, the index of the byte that contains the 1 is a bit-offset, but the bit-index within that byte is actually the byte-offset, for example 0x8000 would correspond to finding the byte at the 7th byte of the qword that was right-shifted by 1, so the actual index is 8*7+1.

There is also the issue of the "tail", the part of the data left over when all blocks of 7 bytes have been processed. It can be done much the same way, but now more positions contain bogus bytes. Now n - i bytes are left over, so the mask has to have n - i bits set in the lowest byte, and one fewer for all other bytes (for the same reason as earlier, the other positions have zeroes shifted in). Also, if there is exactly 1 byte "left", it isn't really left because it would have been tested already, but that doesn't really matter. I'll assume the data is sufficiently padded that accessing out of bounds doesn't matter. Here it is, untested:

if (i < n - 1) {
    // make n-i-1 bits, then copy them to every byte
    uint32_t validh = ((1u << (n - i - 1)) - 1) * 0x01010101;
    // the lowest position has an extra valid bit, set lowest zero
    uint32_t validl = (validh + 1) | validh;
    uint64_t d = *(uint64_t*)(data + i);
    __m256i x = _mm256_set1_epi64x(d);
    __m256i low = _mm256_srlv_epi64(x, _mm256_set_epi64x(3, 2, 1, 0));
    __m256i high = _mm256_srlv_epi64(x, _mm256_set_epi64x(7, 6, 5, 4));
    low = _mm256_cmpeq_epi8(low, needle);
    high = _mm256_cmpeq_epi8(high, needle);
    uint32_t lowmask = validl & _mm256_movemask_epi8(low);
    uint32_t highmask = validh & _mm256_movemask_epi8(high);
    uint64_t mask = lowmask | ((uint64_t)highmask << 32);
    if (mask) {
        int bitindex = __builtin_ffsl(mask);
        return 8 * (i + (bitindex & 7)) + (bitindex >> 3);
    }
}

samgak · Answer 5 · 2015-05-12T05:43:36.643

If you are searching a large amount of memory and can afford an expensive setup, another approach is to use a 64K lookup table. For each possible 16-bit value, the table stores a byte containing the bit shift offset at which the matching octet occurs (+1, so 0 can indicate no match). You can initialize it like this:

uint8_t* g_pLookupTable = malloc(65536);
void initLUT(uint8_t octet)
{
     memset(g_pLookupTable, 0, 65536); // zero out
     for(int i = 0; i < 65536; i++)
     {          
         for(int j = 7; j >= 0; j--)
         {
             if(((i >> j) & 255) == octet)
             {
                 g_pLookupTable[i] = j + 1;
                 break;
             }
         }
     }
}

Note that the case where the value is shifted 8 bits is not included (the reason will be obvious in a minute).

Then you can scan through your array of bytes like this:

 int findByteMatch(uint8_t* pArray, uint8_t octet, int length)
 {
     if(length >= 0)
     {
         uint16_t index = (uint16_t)pArray[0];
         if(index == octet)
             return 0;
         for(int bit, i = 1; i < length; i++)
         {
             index = (index << 8) | pArray[i];
             if(bit = g_pLookupTable[index])
                 return (i * 8) - (bit - 1);
         }
     }
     return -1;
 }

Further optimization:

Read 32 or however many bits at a time from pArray into a uint32_t and then shift and AND each to get byte one at a time, OR with index and test, before reading another 4.
Pack the LUT into 32K by storing a nybble for each index. This might help it squeeze into the cache on some systems.

It will depend on your memory architecture whether this is faster than an unrolled loop that doesn't use a lookup table.

Efficient algorithm for finding a byte in a bit array

5 Answers5