Fastest way to iterate over bits

Question

I have been working on a chess engine for a while and its fairly strong (about 2700 CCRL) and was wondering the following thing:

most top-level chess engines use bitboards. they are basically just 64-bit numbers which represent some binary data like if a square is occupied or not. There are many applications where bitboards are coming in handy but in many scenarios, I need to iterate over all the bits and use the index at which the bit has been set.

Some functions I defined beforehand:

typedef uint64_t U64;
typedef int8_t Square;  //its signed for different reasons

/**
 * returns the index of the LSB
 * @param bb
 * @return
 */
inline Square bitscanForward(U64 bb) {
//    assert(bb != 0);
    return __builtin_ctzll(bb);
}


/**
 * resets the lsb in the given number and returns the result.
 * @param number
 * @return
 */
inline U64 lsbReset(U64 number) {
    return number & (number - 1);;
}

Using those two, I usually iterated bitboards like this:

U64 bb = ....

while(bb){
   Square s = bitscanForward(bb);
   ...
   bb = lsbReset(bb);
}

An alternative solution was:

for(;bb!=0;bb=lsbReset(bb)){
   Square s = bitscanForward(bb);
   ...
}

yet the second one turns out to be slower (for some reason).

My questions therefor are:

Why is the second approach slower?
Are there even faster approaches? Especially in terms of reducing clock cycles by adjusting some computations.

EDIT 1

As wished, I post my testing code. In fact I found a small mistake and both of them run at the same speed. Yet the question remains if there are better implementations.

  U64 k = randU64();
    printBitmap(k);
    
    startMeasure();
    int sum = 0;
    for(int i = 0; i < 1e8; i++){
        U64 copy = k;
        while(copy){
            Square s = bitscanForward(copy);
            sum += s;
            
            copy = lsbReset(copy);
        }
    }
    std::cout << stopMeasure() << "ms sum="<<sum << std::endl;
    
    startMeasure();
     sum = 0;
    for(int i = 0; i < 1e8; i++){
        U64 copy = k;
        for(;copy!=0;copy=lsbReset(copy)){
            Square s = bitscanForward(copy);
            sum += s;
    
        }
    }
    std::cout << stopMeasure() << "ms sum="<<sum << std::endl;

Which outputs:

10101001
01011111
00011111
00000111
01010001
00000011
00110001
10010100

1748ms sum=-1174182400
1757ms sum=-1174182400

Why not just store your board in `vector>` or `bitset<64>`? It is space optimised and will take the same space as your bitboards, moreover it will be easy to iterate also. Bitset are faster than arrays, and also memory efficient. Only you'll need to access by `[i][j]` by `[i*8 + j]` instead in case of bitset. — brc-dd, Jul 21 '20 at 17:05
Show us how you are measuring, what compiler options you are using to build your application such as optimization levels, etc. — PaulMcKenzie, Jul 21 '20 at 17:05
@brc-dd I am fairly sure that this is not a space efficient especially because a single bool will be 8 bits. — Finn Eggers, Jul 21 '20 at 17:06
@FinnEggers There is a separate implementation for [`vector`](https://en.cppreference.com/w/cpp/container/vector_bool). Well, being sure and being correct are two different things. :) — brc-dd, Jul 21 '20 at 17:07
`return __builtin_ctzll(bb);` - Also, you should tag what compiler you're using. This is not a standard function. — PaulMcKenzie, Jul 21 '20 at 17:09
`vector` is space-efficient, but will not be the fastest way to do this. Plus it's a bit weird. — user4581301, Jul 21 '20 at 17:12
How space efficient do you need to be? I would think working in bits is not going to be faster than working in bytes. — Galik, Jul 21 '20 at 17:14
If you want speed, don't use bits, use an array or `std::vector`. With most processors, accessing integers in memory is a lot faster than having to do bit manipulations. To test a bit, you have to create the mask (usually using bit shifting), apply the mask to isolate the bit, then test for zero/nonzero. Integer is: read memory, test for zero/nonzero. Less instructions for the integer version. — Thomas Matthews, Jul 21 '20 at 17:17
yes I have but using any sort of encoding the entries for a sparse matrix will take up a lot of space — Finn Eggers, Jul 21 '20 at 17:17
Search the internet for "Time Memory tradeoff". You're at the point where you are sacrificing time for compact space. You can't have both. — Thomas Matthews, Jul 21 '20 at 17:18
There are multiple advantages of bitboards. Especially because there are operations where they are multiplied with other bitboards (yeah this is sort of crazy) — Finn Eggers, Jul 21 '20 at 17:18
The suggestions for other data representations do not take into account that this is only one aspect of what a chess engine will do, for a lot of other things you do want the bitboard. — harold, Jul 21 '20 at 17:18
I am especially curious if there is a better way to iterate the bits than the implementation I have got — Finn Eggers, Jul 21 '20 at 17:19
These two solutions are the exact same solution! One shouldn't be slower than the other. A 9ms difference out of 1757ms might just be random noise (maybe your NTP daemon decided to update the time or something) — user253751, Jul 21 '20 at 17:23
@harold & FinnEggers I had also suggested `std::bitset<64>` for this question instead of `uint64_t`. Do you have an argument for that too? [Comparison with `vector`](https://cs.up.ac.za/cs/vpieterse/pub/PieterseEtAl_SAICSIT2010.pdf). PS: In the mentioned paper, they have used `boost::dynamic_bitset` instead of `std::bitset`. But using `std::bitset` makes it work even faster, and size is fixed 64 bits. And also its way much easier to implement than what the OP has implemented till now. — brc-dd, Jul 21 '20 at 17:32
Bitboards also require some additional features like: shifting, binary operations, multiplication with other bitboards. — Finn Eggers, Jul 21 '20 at 17:39
@brc-dd it's a decent suggestion, but iteration would not be sparse and would cost a mostly-unpredictable branch per bit. It's also not very friendly to various non-trivial bitboard operations (for example transposing, or the `o^(o-2r)` trick). — harold, Jul 21 '20 at 17:44
https://graphics.stanford.edu/~seander/bithacks.html is always worth exploring. You could try a precomputed lookup table. For example, you can map all 2^16 combinations on 16 bits to an array of up to 16 indices of occupied squares. Then you would use this table four times, for the four 16-square chunks in a U64. But I doubt it will be faster. It also depends on how sparse your boards are. For endgame tables, where only a few bits are set, your method is by far the fastest. — Cătălin Frâncu, Jul 24 '20 at 08:16

Fastest way to iterate over bits

EDIT 1

0 Answers0