4

I'm working on a bitboard-based chess engine, and one of the actions performed profusely is setting/clear bits in an unsigned 64-bit integer. Since I'm not that well-versed in what code will run 'faster' on certain processors, this is something I can't quite wrap my head around.

Setting and clearing bits is quite a simple operation, but should I use (for setting):

uint64_t bitboard |= 1ULL << index;

or:

uint64_t bitboard |= BITMASK[index];

where BITMASK[] is some pre-calculated array of integers where exactly one bit (at index) is set.

On first glance, bitshifting seems like the obvious faster choice, since bitshifts will always be faster than memory lookups.

But in the context of a chess engine, where this operation will probably be performed abundantly, it makes sense for the lookup table to be stored in the processor's cache, which may perhaps make it faster to use a lookup table. Or would it?

Moreover, does it even make a difference?

May perhaps be a silly consideration, but it doesn't hurt to know.

Shreyas
  • 667
  • 2
  • 7
  • 20
  • 3
    Pick a way, code it. If it seems slow, measure it, change it, measure it again. If it seems fast enough, don't worry about it. "Pre-optimization is the root of all evil" :) – Mark Tolonen Nov 13 '15 at 23:25
  • 2
    Writing a simple function and doing several million looks should give you a good estimate what is faster and takes only a matter of minutes to code. – Devolus Nov 13 '15 at 23:26
  • @Devolus Good idea, I'll get right on it. – Shreyas Nov 13 '15 at 23:27
  • Just make sure that your test code mimicks the relevant characteristics of your intended usage to get more reliable results. But anyway, I bet that the shift will be faster. – Devolus Nov 13 '15 at 23:30
  • Shifting is a dedicated instruction for basically all architectures – qwr Nov 13 '15 at 23:32
  • Since you tagged this as C++, you should consider bitset. Maybe a 3rd test of style. In my embedded software, the team almost always used shift operations, even to mask a field in the middle. Also, I'm sure you know that you can not clear a bit with an or ('^="). – 2785528 Nov 13 '15 at 23:33
  • @DOUGLASO.MOEN Oops, I meant to use OR to set. – Shreyas Nov 13 '15 at 23:39
  • And use "bitboard &= ~( bit-pattern )" to clear – 2785528 Nov 14 '15 at 00:05
  • From what I see so far, lookup tables are faring consistently faster. Not by much, though. – Shreyas Nov 14 '15 at 00:07
  • 1
    Write your code with a macro (or inline function) everywhere you do this operation. Then when you have finished your program you can test the differences in a live scenario . – M.M Nov 14 '15 at 00:57

2 Answers2

3

The shift method should be faster compared to table look up as it avoids an extra memory reference. But for educational purposes it would be interesting to benchmark.

Sasha Pachev
  • 5,162
  • 3
  • 20
  • 20
  • But the processor will probably cache the lookup table, which would remove the overhead of a memory lookup! – Shreyas Nov 13 '15 at 23:31
  • 2
    CPU cache is still not as fast as register, so we are talking cached memory deference vs shift, shift is an elementary operation that I would expect to take just one cycle on a modern CPU with very low disruption to the subsequent instructions. – Sasha Pachev Nov 13 '15 at 23:37
3

I quickly whipped up this (very crude, pardon) function:

#include <iostream>
#include <random> // std::mt19937()

typedef unsigned long long uint64;

uint64 SET_BITMASK[64];

void init_bitmask()
{
    for(int i = 0; i < 64; i++) SET_BITMASK[i] = 1ULL << i;
}

int main()
{
    std::mt19937 gen_rand(42);
    uint64 bb = 0ULL;
    double avg1, avg2;

    init_bitmask();

    for(unsigned int i = 0; i < 10; i++)
    {
        std::clock_t begin = std::clock();

        for(unsigned int j = 0; j < 99999999; j++)
        {
            bb |= 1ULL << (gen_rand() % 64);
        }

        std::clock_t end = std::clock();

        std::cout << "For bitshifts, it took: " << (double) (end - begin) / CLOCKS_PER_SEC << "s." << std::endl;
        avg1 += (double) (end - begin) / CLOCKS_PER_SEC;

        bb = 0ULL;

        begin = std::clock();

        for(unsigned int j = 0; j < 99999999; j++)
        {
            bb |= SET_BITMASK[gen_rand() % 64];
        }

        end = std::clock();

        std::cout << "For lookups, it took: " << (double) (end - begin) / CLOCKS_PER_SEC << "s." << std::endl << std::endl;
        avg2 += (double) (end - begin) / CLOCKS_PER_SEC;
    }

    std::cout << std::endl << std::endl << std::endl;

    std::cout << "For bitshifts, the average is: " << avg1 / 10 << "s." << std::endl;
    std::cout << "For lookups, the average is: " << avg2 / 10 << "s." << std::endl;
    std::cout << "Lookups are faster by " << (((avg1 / 10) - (avg2 / 10)) / (avg2 / 10))*100 << "%." << std::endl;
}

An average of ten over one hundred million bit sets for each iteration is 1.61603s for bitshifts and 1.57592s for lookups consistently (even for different seed values).

Lookup tables astonishingly seem consistently faster by roughly 2.5% (in this particular use case).

Note: I used random numbers to prevent any inconsistencies, as shown below.

If I use i % 64 to shift/index, bitshifting is faster by about 6%.

If I use a constant to shift/index, the output is varied by about 8%, between -4% and 4%, which makes me think that some funny guessing business is in play. Either that, or they average to 0% ;)

I cannot draw a conclusion since this is certainly not a real scenario, as even in a chess engine, these set bit cases won't follow each other in rapid succession. All I can say is that the difference is probably negligible. I can also add that lookup tables are inconsistent, as you are at the mercy of whether the tables have been cached. I'm personally going to use bitshifts in my engine.

Shreyas
  • 667
  • 2
  • 7
  • 20
  • This overhead of gen_rand() % 64 is huge compared to either memory dereference or a binary shift. Because of the loop you are also hitting the issue of branch prediction which could create some noise. Try doing it 10 times or more in the loop. The optimizer probably already translates % 64 as & 0x3f but it might be good to do that explicitly. You can also try disabling all optimization options in the compiler. Make sure to disassemble the code for each case to see what is really happening. – Sasha Pachev Nov 14 '15 at 03:41
  • @SashaPachev I used random numbers specifically to prevent inconsistencies with branch prediction, although I can't tell much it helped if at all. I've done a thousand iterations too, with the same result. I also used a constant, instead of gen_rand() % 64, which gives inconsistent results varying over 10% or so. I suspect that's some branch prediction silly business there. Although, I'm not going to go as far as to look as disassembled code. I've already taken M M's advice and used inline functions, so I can test it in my particular case when my engine is finished. – Shreyas Nov 14 '15 at 13:29