0

I am trying to reduce the execution time of the if-statement shown below (second block of code). It involves a bit-mask where the masks array contain 8 integers used as masks and setup as follows:

static unsigned int masks[8];

void setupMasks() {
    int mask = 3; // 0000 0000 0000 0000 0000 0000 0000 0011
    for(unsigned int i=0; i < 8; i++) {
        masks[i] = (mask << (i * 4));
    }
}

Each integer in the testarr below actually contains 8 results. Each result is 4 bits of the 32-bit int and I only want to know if any of the lower-two out of the 4 bits is a 1. The code below is executed inside another for-loop that updates resultnum. failcount is a locally-defined int array. I would like to avoid masking, but the data in testarr comes from an API that I cannot change. In any case, I think the if-statement consumes more time than masking, but I could be wrong. Does anyone see a way to optimize?

for(unsigned int i = 0; i < 8 && dumped < numtodump; i++, dumped++) { //8 results per 32-bit value
    unsigned int fails = 0;
    for(unsigned int j = 0; j < 32; j++) {
    if((testarr[j * numintsperpin + resultnum] & masks[i]) && failcount[j]++ <= 10000) { //have a fail
            failingpins[fails++] = &pins[j];
        }
    }
}

Sorry if my previous post was not clear. Below is the full function. I tried to simplify the problem statement as much as possible earlier. Sorry if I left out useful details.

void process(vector<int> &testarr, vector<unsigned int> &failcount, vector<pin> &pins, vector<unsigned int> &muxaddr, unsigned int base, StopWatch &acc1) {
    unsigned int labeloffset = 400;
    unsigned int startindex = 50;
    unsigned int numtodump = 1000;
    unsigned int numintsperpin = testarr.size() / pins.size();
    pin** failingpins = new pin*[32];
    acc1.start();
    int count = 0;
    unsigned int dumped = 0;
    unsigned int resultnum = 0;
    while(dumped < numtodump) {
        for(unsigned int i = 0; i < 8 && dumped < numtodump; i++, dumped++) { //8 results per 32-bit value
            unsigned int currentindex = labeloffset + dumped + startindex;
            unsigned int fails = 0;
            for(unsigned int j = 0; j < pins.size(); j++) {
                if((testarr[j * numintsperpin + resultnum] & masks[i]) && failcount[j]++ <= 10000) { //have a fail
                    failingpins[fails++] = &pins[j];
                }
            }
            unsigned int saddr = muxaddr[currentindex];
            if(fails > 0) {             
                logFails(fails, muxaddr[currentindex] - base, failingpins);
            }
        }
        resultnum++;
    }
    acc1.accumulate();  
}
PentiumPro200
  • 641
  • 7
  • 23
  • I have to admit, Im still trying to understand the whole point of what you are trying to solve! – trumpetlicks Feb 26 '14 at 23:30
  • I don't really understand what your code is doing here. But it seems to me that the index of `testarr` is calculated irrelevant of `i`. So maybe you can swap the order of the two loops, i.e. iterate through `j` outside and `i` inside. You can reduce the time of calculation of `testarr[j * numintsperpin + resultnum]` to one-eighth by caching it in a temporary value, and use the temporary value during the should-be-inner loop which iterates through 8 bitmasks. – james Feb 26 '14 at 23:34
  • Do you expect many more passes than failures? In that is the case you could start by first doing one test against a composite mask of 00110011001100110011001100110011 and if that passes you can skip the test against the 8 submasks. – Jens Agby Feb 26 '14 at 23:41

2 Answers2

0

See if I have this right:

Each entry in testarr is a 32-bit value containing 8 x 4-bit fields

You want to know if any field has either of the lower 2 bits set, i.e. you want to mask each 32-bit value with:

0011 0011 0011 0011 0011 0011 0011 0011

so why not:

for( int i=0; i<testarr_length; i++ )
   if( testarr[i] & 0x33333333 )
      // do something !

If you need to know how many fields have these bits set, then

for( int i=0; i<testarr_length; i++ )
{
   unsigned int dword= testarr[i];
   for( int field=0; field<8; field++ )
   {
        if( dword & 0x3 )
            // do something
        dword= dword >> 4;
   }     
}
TonyWilk
  • 1,447
  • 10
  • 13
  • I actually need to know which specific 4-byte subset of the 32-bit integer is non-zero (although only the lower 2 bits could possibly be 1). In the loop above, I log the 4-byte subsets which are non-zero. – PentiumPro200 Feb 27 '14 at 01:34
0

You can try the following

inline int count(int x)
{
    static int mask1 = 0x11111111;
    static int mask2 = 0x22222222;
    return __builtin_popcount(x & mask1 | x & mask2 << 1);
}

// ...

unsigned int fails = 0;
for(unsigned int j = 0; j < 32; j++) {
    int c = count(testarr[j * numintsperpin + resultnum]);
    if(c && (failcount[j]+=c) <= 10000) { //have a fail
        failingcols[fails+=c] = &column[j];
    }
}

where I have split the mask into two separate masks and I have used function __builtin_popcount that counts the bits of a given integer in just one CPU operation, thus avoiding the loop over i altogether.

__builtin_popcount should be provided by the compiler, e.g. the example above works with Clang and GCC with option -msse4.2. As far as I recall, the MS compiler respectively provides function __popcnt.

I don't know what the role of dumped is but it doesn't show inside your loop so I just ignored it.

EDIT

I have now seen the updated question where it appears that dumped plays a significant role in recoding the actual position of failures apart from their count. In this case my solution does not apply. This new problem is much more difficult to optimize.

iavr
  • 7,547
  • 1
  • 18
  • 53