C++ bitmask / conditional branch optimization

Question

I am trying to reduce the execution time of the if-statement shown below (second block of code). It involves a bit-mask where the masks array contain 8 integers used as masks and setup as follows:

static unsigned int masks[8];

void setupMasks() {
    int mask = 3; // 0000 0000 0000 0000 0000 0000 0000 0011
    for(unsigned int i=0; i < 8; i++) {
        masks[i] = (mask << (i * 4));
    }
}

Each integer in the testarr below actually contains 8 results. Each result is 4 bits of the 32-bit int and I only want to know if any of the lower-two out of the 4 bits is a 1. The code below is executed inside another for-loop that updates resultnum. failcount is a locally-defined int array. I would like to avoid masking, but the data in testarr comes from an API that I cannot change. In any case, I think the if-statement consumes more time than masking, but I could be wrong. Does anyone see a way to optimize?

for(unsigned int i = 0; i < 8 && dumped < numtodump; i++, dumped++) { //8 results per 32-bit value
    unsigned int fails = 0;
    for(unsigned int j = 0; j < 32; j++) {
    if((testarr[j * numintsperpin + resultnum] & masks[i]) && failcount[j]++ <= 10000) { //have a fail
            failingpins[fails++] = &pins[j];
        }
    }
}

Sorry if my previous post was not clear. Below is the full function. I tried to simplify the problem statement as much as possible earlier. Sorry if I left out useful details.

void process(vector<int> &testarr, vector<unsigned int> &failcount, vector<pin> &pins, vector<unsigned int> &muxaddr, unsigned int base, StopWatch &acc1) {
    unsigned int labeloffset = 400;
    unsigned int startindex = 50;
    unsigned int numtodump = 1000;
    unsigned int numintsperpin = testarr.size() / pins.size();
    pin** failingpins = new pin*[32];
    acc1.start();
    int count = 0;
    unsigned int dumped = 0;
    unsigned int resultnum = 0;
    while(dumped < numtodump) {
        for(unsigned int i = 0; i < 8 && dumped < numtodump; i++, dumped++) { //8 results per 32-bit value
            unsigned int currentindex = labeloffset + dumped + startindex;
            unsigned int fails = 0;
            for(unsigned int j = 0; j < pins.size(); j++) {
                if((testarr[j * numintsperpin + resultnum] & masks[i]) && failcount[j]++ <= 10000) { //have a fail
                    failingpins[fails++] = &pins[j];
                }
            }
            unsigned int saddr = muxaddr[currentindex];
            if(fails > 0) {             
                logFails(fails, muxaddr[currentindex] - base, failingpins);
            }
        }
        resultnum++;
    }
    acc1.accumulate();  
}

I have to admit, Im still trying to understand the whole point of what you are trying to solve! — trumpetlicks, Feb 26 '14 at 23:30
I don't really understand what your code is doing here. But it seems to me that the index of `testarr` is calculated irrelevant of `i`. So maybe you can swap the order of the two loops, i.e. iterate through `j` outside and `i` inside. You can reduce the time of calculation of `testarr[j * numintsperpin + resultnum]` to one-eighth by caching it in a temporary value, and use the temporary value during the should-be-inner loop which iterates through 8 bitmasks. — james, Feb 26 '14 at 23:34
Do you expect many more passes than failures? In that is the case you could start by first doing one test against a composite mask of 00110011001100110011001100110011 and if that passes you can skip the test against the 8 submasks. — Jens Agby, Feb 26 '14 at 23:41

score 0 · Answer 1 · answered Feb 27 '14 at 00:14

See if I have this right:

Each entry in testarr is a 32-bit value containing 8 x 4-bit fields

You want to know if any field has either of the lower 2 bits set, i.e. you want to mask each 32-bit value with:

0011 0011 0011 0011 0011 0011 0011 0011

so why not:

for( int i=0; i<testarr_length; i++ )
   if( testarr[i] & 0x33333333 )
      // do something !

If you need to know how many fields have these bits set, then

for( int i=0; i<testarr_length; i++ )
{
   unsigned int dword= testarr[i];
   for( int field=0; field<8; field++ )
   {
        if( dword & 0x3 )
            // do something
        dword= dword >> 4;
   }     
}

I actually need to know which specific 4-byte subset of the 32-bit integer is non-zero (although only the lower 2 bits could possibly be 1). In the loop above, I log the 4-byte subsets which are non-zero. — PentiumPro200, Feb 27 '14 at 01:34

iavr · Answer 2 · 2014-02-27T08:31:03.330

You can try the following

inline int count(int x)
{
    static int mask1 = 0x11111111;
    static int mask2 = 0x22222222;
    return __builtin_popcount(x & mask1 | x & mask2 << 1);
}

// ...

unsigned int fails = 0;
for(unsigned int j = 0; j < 32; j++) {
    int c = count(testarr[j * numintsperpin + resultnum]);
    if(c && (failcount[j]+=c) <= 10000) { //have a fail
        failingcols[fails+=c] = &column[j];
    }
}

where I have split the mask into two separate masks and I have used function __builtin_popcount that counts the bits of a given integer in just one CPU operation, thus avoiding the loop over i altogether.

__builtin_popcount should be provided by the compiler, e.g. the example above works with Clang and GCC with option -msse4.2. As far as I recall, the MS compiler respectively provides function __popcnt.

I don't know what the role of dumped is but it doesn't show inside your loop so I just ignored it.

EDIT

I have now seen the updated question where it appears that dumped plays a significant role in recoding the actual position of failures apart from their count. In this case my solution does not apply. This new problem is much more difficult to optimize.

C++ bitmask / conditional branch optimization

2 Answers2