Fastest function to set bits to one between two bits in an unsigned integer

Question

I have an algorithm for simulations on supercomputers that will need the use of a lot of bit manipulations. Some operations will need masks and particularly a function like this:

template <typename Type,
          class = typename std::enable_if<std::is_integral<Type>::value>::type,
          class = typename std::enable_if<std::is_unsigned<Type>::value>::type>
inline Type mask(const std::size_t first, const std::size_t last)
{
     // Something
}

that will generate a mask of type Type where the bits in the range [first, last[ are set to 1 (first and last are runtime variables)

For example:

mask<unsigned char>(3, 6) -> 00111000

I will need hundreds of billions of these masks so I need this function to be as optimized as possible (but in plain standard C++11). How to do that ?

If `first` and `last` are known at compile time, you can make them template parameters and force the compiler to do the busywork once and for all. It may already do that on its own though (inlining + constant-folding can work wonders on small functions, and this function will presumably be *very* small). Check the assembly. — , Jan 21 '14 at 22:46
@delnan: thank you for the suggestion, but I already had compile-time versions of the functions to do the job when first and last are already known. — Vincent, Jan 21 '14 at 23:00
You know it's a real supercomputer when the bit positions in a variable need to be expressed in a `size_t` because an `int` is just too small. — MSalters, Jan 22 '14 at 00:13
@MSalters I'm glad I wasn't drinking anything when I read your comment, my monitor would be all wet now. — Mark Ransom, Jan 22 '14 at 00:20
@MSalters: Hahhaha I agree with Mark, that comment made my day. — user541686, Jan 23 '14 at 10:52

Mark Ransom · Accepted Answer · 2014-01-21T22:54:03.017

8

return (1 << last) - (1 << first);

edited Jan 21 '14 at 22:54

answered Jan 21 '14 at 22:48

Mark Ransom

299,747
42
398
622

Good answer! I like it. Should be pretty speedy. (Although it will generate garbage for backwards inputs.) – StilesCrisis Jan 21 '14 at 22:50
But it's off by one - see my comment. – Ivan Vergiliev Jan 21 '14 at 22:50
@IvanVergiliev I was quite convinced that I had it right when I wrote it, but I just tested it and you're correct. Fixed. I think I was confused about whether the `last` bit should be included or not, and by the example it should not. – Mark Ransom Jan 21 '14 at 22:54
It does not work if `last == sizeof(Type)*8`. Is there a simple way to correct it ? – Vincent Jan 21 '14 at 23:17
@Vincent you could try casting: `(static_cast(1) << last) - (static_cast(1) << first)`. – Mark Ransom Jan 21 '14 at 23:19

score 2 · Answer 2 · answered Jan 21 '14 at 22:48

2

You could make a lookup table and the cost would be a single memory read. If you're using 32-bit elements, the table only needs to be 32x32 = 1024 words in memory (4 kilobytes), so it'll stay in cache if you're using it heavily. Even for 64-bit elements, a 64x64 lookup is only 4096 words (32 kilobytes).

answered Jan 21 '14 at 22:48

StilesCrisis

15,972
4
39
62

1

4 KiB sounds okay, but 32 KiB seems a bit hefty. I don't know how much L1 cache this supercomputer has, but for a recent x86 core that's the *entire* L1 dcache. If the simulation needs any other data (at least the integers to bit manipulate!) I could see this hurting performance more than a few integer operations to calculate the value on demand. – Jan 21 '14 at 22:52
1

Coming to think of it, I'm not even sure whether this will be faster if the entire table is in a magical free bonus L1 cache. The index calculation (shift, add) and memory fetch takes as many instructions (presumably all single-cycle) as the `(1 << last) - (1 << first)` variant, with less potential for parallelism. Assuming a barrel shifter, but that's not exactly an exotic feature. – Jan 21 '14 at 22:55
On a Power-like architecture, you're probably right. On Intel, I'd bet the lookup is fewer instructions. Once it gets into microcode format it may well be a wash again, however. – StilesCrisis Jan 21 '14 at 23:09
I should have been clearer, I was talking about cycles primarily and less about instruction counts (those are a means to an end). A single CISC instruction which does the shift and add combined presumably wouldn't save any clock cycles on hardware from this century, at least latency wise. Unless those CPUs have special circuitry for this specific pattern, which would be really simple at the wire level, just very narrow-special-purpose. – Jan 21 '14 at 23:15
The `LEA` opcode is pretty heavily used for exactly this sort of thing. I wouldn't be surprised if they had silicon dedicated to making `LEA` run quick. – StilesCrisis Jan 22 '14 at 01:38
I thought of `LEA` but it seems to me that it doesn't support a scale of five or six bits, at least not in a single cycle/instruction. It can only shift by 1, 2, 4 or 8. Here, the "specific pattern" is "shift by *any* amount, and add". – Jan 22 '14 at 08:56

score 2 · Answer 3 · answered Jan 22 '14 at 03:34

This is an extract from the standard:

Shift operators

[expr.shift]

... The behavior is undefined if the right operand is negative, or greater than or equal to the length in bits of the promoted left operand.

That's why the expression '(1 << last) - (1 << first)' does not work when last == sizeof(Type)*CHAR_BIT. I propose you another alternative that computes values at compile-time when possible. See the following example:

#include <limits>
#include <iostream>
#include <bitset>


template <class Integer>
constexpr Integer ones()
{
    return ~static_cast<Integer>(0);
}


template <class Integer>
constexpr Integer mask(std::size_t first, std::size_t last)
{
    return (ones<Integer>() << first) &
           (ones<Integer>() >> (std::numeric_limits<Integer>::digits - last));
}


//Requires: first is in [0,8) and last is in (0,8]
void print8(std::size_t first, std::size_t last)
{
    std::cout << std::bitset<8>(mask<unsigned char>(first, last)) << '\n';
}

int main()
{
    print8(0,1); //000000001
    print8(2,6); //001111100
    print8(0,8); //111111111
    print8(2,2); //000000000

    static_assert(mask<unsigned char>(0,8) == 255,
                  "It should work at compile-time when possible");
}

score 0 · Answer 4 · answered Jan 21 '14 at 23:08

maybe just a little change to reflect the meaning (in my understanding) of first and last from example given by OP.

#include <iostream>
#include <bitset>
using namespace std;

unsigned char mask( int first, int last) {
    return (1 << (8-first)+1) - (1 << (8-last));
}
/*
 * 
 */
int main(int argc, char** argv) {
    cout << bitset<8>(mask(3,6)) << endl; //prints:00111100
    cout << bitset<8>(mask(2,6)) << endl; //prints:01111100
    cout << bitset<8>(mask(1,3)) << endl; //prints:11100000
    cout << bitset<8>(mask(1,7)) << endl; //prints:11111110
    return 0;
}

Fastest function to set bits to one between two bits in an unsigned integer

4 Answers4