4

Suppose we have an std::vector, or any other sequence container (sometimes it will be a deque), which store uint64_t elements.

Now, let's see this vector as a sequence of size() * 64 contiguous bits. I need to find the word formed by the bits in a given [begin, end) range, given that end - begin <= 64 so it fits in a word.

The solution I have right now finds the two words whose parts will form the result, and separately masks and combines them. Since I need this to be as efficient as possible, I've tried to code everything without any if branch to not cause branch mispredictions, so for example the code works in both cases when the entire range fits into a word or when it spans two words, without taking different paths. To do this I needed to code those shiftl and shiftr functions, which do nothing but shifting a word by the specified amount, like the >> and << operators, but gracefully handling the case when n is greater than 64, which would be undefined behavior otherwise.

Another point is that the get() function, as coded now, works also for empty ranges, in a mathematical sense, e.g. not only if begin == end, but also if begin > end, which is required by the main algorithm that is calling this function. Again, I've tried to do this without simply branching and returning zero in that case.

However, also looking at the assembly code, all this seems far too complex to perform such a seemingly simple task. This code runs in a performance-critical algorithm, which is running a bit too slow. valgrind told us this function is called 230 million times and accounts for the 40% of the total execution time, so I would really need to make it faster.

So can you help me to find a simpler and/or more efficient way to accomplish this task? I don't care too much about portability. Solutions using x86 SIMD intrinsics (SSE3/4/AVX ecc...) or compiler builtins are ok, as far as I can compile them with both g++ and clang.

My current code is included below:

using word_type = uint64_t;
const size_t W = 64;

// Shift right, but without being undefined behaviour if n >= 64
word_type shiftr(word_type val, size_t n)
{
    uint64_t good = n < W;

    return good * (val >> (n * good));
}

// Shift left, but without being undefined behaviour if n >= 64
word_type shiftl(word_type val, size_t n)
{
    uint64_t good = n < W;

    return good * (val << (n * good));
}

// Mask the word preserving only the lower n bits.
word_type lowbits(word_type val, size_t n)
{
    word_type mask = shiftr(word_type(-1), W - n);

    return val & mask;
}

// Struct for return values of locate()
struct range_location_t {
    size_t lindex; // The word where is located the 'begin' position
    size_t hindex; // The word where is located the 'end' position
    size_t lbegin; // The position of 'begin' into its word
    size_t llen;   // The length of the lower part of the word
    size_t hlen;   // The length of the higher part of the word
};

// Locate the one or two words that will make up the result
range_location_t locate(size_t begin, size_t end)
{
    size_t lindex = begin / W;
    size_t hindex = end / W;
    size_t lbegin = begin % W;
    size_t hend   = end % W;

    size_t len = (end - begin) * size_t(begin <= end);
    size_t hlen = hend * (hindex > lindex);
    size_t llen = len - hlen;

    return { lindex, hindex, lbegin, llen, hlen };
}

// Main function.
template<typename Container>
word_type get(Container const&container, size_t begin, size_t end)
{
    assert(begin < container.size() * W);
    assert(end <= container.size() * W);

    range_location_t loc = locate(begin, end);

    word_type low = lowbits(container[loc.lindex] >> loc.lbegin, loc.llen);

    word_type high = shiftl(lowbits(container[loc.hindex], loc.hlen), loc.llen);

    return high | low;
}

Thank you very much.

gigabytes
  • 3,104
  • 19
  • 35
  • What about using a [`std::bitset`](http://en.cppreference.com/w/cpp/utility/bitset) as intermediary? – πάντα ῥεῖ Dec 23 '14 at 09:59
  • You can't assume that 'good' is either 0 or 1 when you do when returning 'good*...' : that's non portable. Use ternary operator instead: 'return good?...:0;' – Christophe Dec 23 '14 at 10:11
  • 3
    @Christophe Perfectly portable in C++. `<` results in a `bool`, and conversion from that to any other integer yields only 0 or 1. See C++11, 4.7/4. – Sebastian Redl Dec 23 '14 at 11:06
  • Did you actually profile a simple, branching version of this code? – Sebastian Redl Dec 23 '14 at 11:16
  • 2
    Also, did you run cachegrind to check that the problem isn't lots of cache misses in this function (if the vector is large and the access pattern is very random), instead of actual CPU time? Because in that case, you need to optimize the usage pattern of this function, not the function itself. – Sebastian Redl Dec 23 '14 at 11:21
  • @πάνταῥεῖ, How could I use `std::bitset`? It doesn't seem to support extracting a subrange. @SebastianRedl, yes, a previous version of this code contained a branch to test if the range was contained into a single word or spanned two of them. These two cases happen randomly in the algorithm, even if the first is more probable. By eliminating it I've cut the total branch mispredictions from 15% to 4%. I've not tested the branches in the shiftl/shiftr functions. Also, cache misses seem to not be a problem. – gigabytes Dec 23 '14 at 11:23
  • In assembly you could do it with `shld \ shr` with some stuff around it to calculate shift amounts and such but basically that's it. But as far as I know there's no intrinsic for `shld`. Are you OK with assembly? – harold Dec 23 '14 at 11:38
  • @gigabytes You still have a bug in your code: if `end == size*W`, you access `container[size]`. You don't do anything with that value, but the access itself should assert in debug mode and is undefined in release. Which means your algorithm can never access the last bit. – Sebastian Redl Dec 23 '14 at 11:48
  • Different thing: which valgrind module told you that you spend 40% of your time in this function? I would trust a real profiler (gprof or oprofile) more for such information. – Sebastian Redl Dec 23 '14 at 12:23
  • It might be faster (and definitely simpler) to always get a 64 bit sequence first, and shift it at the end if you need a smaller part. – Kicsi Dec 23 '14 at 12:47
  • @SebastianRedl I used callgrind. Never used gprof, have to look into it. Better gprof or oprofile? – gigabytes Dec 23 '14 at 14:20
  • Why do you need shifts larger than 64 ? –  Dec 23 '14 at 14:28

2 Answers2

4

This replaces get() and all the helper functions used by get(). It contains a conditional branch and saves about 16 arithmetic operations, which means it should usually run faster. Compiled with some optimization, it also produces very short code. Finally, it solves the bug which caused container[container.size()] to be accessed in case end==container.size()*W.

The most tricky part is "hi-(hi>0)" which subtracts 1 from hi except if hi was 0. Subtracting 1 does not change anything except if hi points just to a word boundary, that is, hi%64==0. In that case we need 0 bits from the upper container entry, thus using just the lower container entry suffices. By subtracting 1 before computing hi_off we ensure the condition "hi_off==lo_off", and we run into the simpler case.

In that simpler case we need just one container entry and cut away some bits on both sides. hi_val is that entry, and the upper bits are already cut away, thus the only thing left to do is deleting some lower bits.

In the less simple case we have to read the lower container entry as well, strip the unused bytes of it, and combine both entries.

namespace {
  size_t   const upper_mask = ~(size_t)0u << 6u;
  unsigned const lower_mask = (unsigned)~upper_mask;
}

word_type get ( Container const &container, size_t lo, size_t hi )
{
  size_t lo_off = lo       >>6u;  assert ( lo_off < container.size() );
  size_t hi_off = hi-(hi>0)>>6u;  assert ( hi_off < container.size() );
  unsigned hi_shift = lower_mask&(unsigned)(upper_mask-hi);
  word_type hi_val = container[hi_off] << hi_shift >> hi_shift;
  unsigned lo_shift = lower_mask&(unsigned)lo;
  if ( hi_off == lo_off ) return hi_val >> lo_shift; // use hi_val as lower word
  return ( hi_val<<W-lo_shift | container[lo_off]>>lo_shift ) * (lo_off<hi_off);
}
Hans Klünder
  • 2,176
  • 12
  • 8
  • Very interesting! But I still have to understand it very well... Why do you shift by 6? – gigabytes Dec 24 '14 at 00:41
  • >>6u does the same as /W - it divides by 64, computing the index of the word that contains the desired bit. Usually the compiler's optimizer should see that anyway, but just in case it fails, I tell it how to divide by 64 quickly. By the way, you will notice that I do not use a special shift function that works well for shifting by 64 bits or more. That's because in my implementations, all shift operations shift by less than 64 bits. – Hans Klünder Dec 24 '14 at 08:52
  • Yeah, that shift was obvious, sorry... I don't think it's needed in any reasonable compiler, anyway. So, you use `lower_mask` to have the number modulo 64 instead of using the `%` operator, right? But why do you do `upper_mask - hi`? – gigabytes Dec 24 '14 at 10:59
  • (upper_mask-hi)%64 is the same as (64-hi%64)%64, just faster. hi_shift is the number of bits to be cut away on the left side of the higher word. Usually, it's the same as 64-hi%64, but not if hi is a multiple of 64. In that case, 64-hi%64 is 64, and hi_off is the same as lo_off, thus hi_val will be used as the lower word and I want to cut off nothing on the left side, thus hi_shift should be 0. I achieve that by first subtracting hi from the highest possible multiple of 64 and then applying %64 to the result. It's just the fastest way to make hi_shift have the value I need in all cases. – Hans Klünder Dec 24 '14 at 11:39
  • I just tested the speed. My proposal needs less than half of the time compared to the original version, both using std::vector and compiled using g++ with -O2, but my proposal is hard to read. The key is to understand the two possible cases, the fact that hi_val is the same as the lower word in the simpler case, and the way this is handled if hi is a multiple of 64 without adding another special check for that case. The less simple case, hi_off!=lo_off, works more straightforward. The simple case, hi_off==lo_off, works fine even if hi is a multiple of 64, and that's not quite obvious. – Hans Klünder Dec 24 '14 at 11:57
  • It's not obvious at all, but is very clever! If I can replicate the results of your benchmark in my code this is great! Thank you very much. – gigabytes Dec 24 '14 at 12:03
  • The board starts to complain for the long discussions in the comments so this is the last question: Why do you use "unsigned" as a type rather than ensure the right bitsize, especially in the variables used as a mask? On my platform unsigned is uint32_t, was exactly your intention to use a 32bit unsigned integer here? – gigabytes Dec 24 '14 at 12:42
  • Glad that the code helps! I use unsigned for small numbers because it's considered the natural size for the processor, thus I hope computations are fastest in unsigned. But I may be wrong. If there are any more questions, I'll click on the link "automatically move this discussion to chat" - this is what stackoverflow suggests me to do. I never did that before, but I hope the discussion can still be found then, and the board will be happy. – Hans Klünder Dec 24 '14 at 12:56
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/67628/discussion-between-gigabytes-and-hans-klunder). – gigabytes Dec 24 '14 at 15:22
3

As announced in the chat, I add a refined answer. It contains three parts, each of them followed by a description of that part.

The 1st part, get.h, is my solution, but generalized and with one correction.

The 2nd part, got.h, is the original algorithm as posted in the question, generalized as well to work with any STL container of any unsigned type.

The 3rd part, main.cpp, contains unit tests which verify the correctness and measure performance.

#include <cstddef>

using std::size_t;

template < typename C >
typename C::value_type get ( C const &container, size_t lo, size_t hi )
{

   typedef typename C::value_type item; // a container entry
   static unsigned const bits = (unsigned)sizeof(item)*8u; // bits in an item
   static size_t const mask = ~(size_t)0u/bits*bits; // huge multiple of bits

   // everthing above has been computed at compile time. Now do some work:

   size_t lo_adr = (lo       ) / bits; // the index in the container of ...
   size_t hi_adr = (hi-(hi>0)) / bits; // ... the lower or higher item needed

   // we read container[hi_adr] first and possibly delete the highest bits:

   unsigned hi_shift = (unsigned)(mask-hi)%bits;
   item hi_val = container[hi_adr] << hi_shift >> hi_shift;

   // if all bits are in the same item, we delete the lower bits and are done:

   unsigned lo_shift = (unsigned)lo%bits;
   if ( hi_adr <= lo_adr ) return (hi_val>>lo_shift) * (lo<hi);

   // else we have to read the lower item as well, and combine both

   return ( hi_val<<bits-lo_shift | container[lo_adr]>>lo_shift );

}

The 1st part, get.h above, is my original solution, but generalized to work with any STL containers of unsigned integer types. Thus you can use and test it for 32-bit integers or 128-bit integers as well. I still use unsigned for very small numbers, but you may as well replace them by size_t. The algorithm is nearly unchanged, with a small correction - if lo was the total number of bits in the container, my previous get() would access an item just above the container size. This is fixed now.

#include <cstddef>

using std::size_t;

// Shift right, but without being undefined behaviour if n >= 64
template < typename val_type >
val_type shiftr(val_type val, size_t n)
{
   val_type good = n < sizeof(val_type)*8;
   return good * (val >> (n * good));
}

// Shift left, but without being undefined behaviour if n >= 64
template < typename val_type >
val_type shiftl(val_type val, size_t n)
{
   val_type good = n < sizeof(val_type)*8;
   return good * (val << (n * good));
}

// Mask the word preserving only the lower n bits.
template < typename val_type >
val_type lowbits(val_type val, size_t n)
{
    val_type mask = shiftr<val_type>((val_type)(-1), sizeof(val_type)*8 - n);
    return val & mask;
}

// Struct for return values of locate()
struct range_location_t {
   size_t lindex; // The word where is located the 'begin' position
   size_t hindex; // The word where is located the 'end' position
   size_t lbegin; // The position of 'begin' into its word
   size_t llen;   // The length of the lower part of the word
   size_t hlen;   // The length of the higher part of the word
};

// Locate the one or two words that will make up the result
template < typename val_type >
range_location_t locate(size_t begin, size_t end)
{
   size_t lindex = begin / (sizeof(val_type)*8);
   size_t hindex = end / (sizeof(val_type)*8);
   size_t lbegin = begin % (sizeof(val_type)*8);
   size_t hend   = end % (sizeof(val_type)*8);

   size_t len = (end - begin) * size_t(begin <= end);
   size_t hlen = hend * (hindex > lindex);
   size_t llen = len - hlen;

   range_location_t l = { lindex, hindex, lbegin, llen, hlen };
   return l;
}

// Main function.
template < typename C >
typename C::value_type got ( C const&container, size_t begin, size_t end )
{
   typedef typename C::value_type val_type;
   range_location_t loc = locate<val_type>(begin, end);
   val_type low = lowbits<val_type>(container[loc.lindex] >> loc.lbegin, loc.llen);
   val_type high = shiftl<val_type>(lowbits<val_type>(container[loc.hindex], loc.hlen), loc.llen);
   return high | low;
}

This 2nd part, got.h above, is the original algorithm in the question, generalized as well to accept any STL containers of any unsigned integer types. Like get.h, this version does not use any definitions except the single template parameter that defines the container type, thus it can easily be tested for other item sizes or container types.

#include <vector>
#include <cstddef>
#include <stdint.h>
#include <stdio.h>
#include <sys/time.h>
#include <sys/resource.h>
#include "get.h"
#include "got.h"

template < typename Container > class Test {

   typedef typename Container::value_type val_type;
   typedef val_type (*fun_type) ( Container const &, size_t, size_t );
   typedef void (Test::*fun_test) ( unsigned, unsigned );
   static unsigned const total_bits = 256; // number of bits in the container
   static unsigned const entry_bits = (unsigned)sizeof(val_type)*8u;

   Container _container;
   fun_type _function;
   bool _failed;

   void get_value ( unsigned lo, unsigned hi ) {
      _function(_container,lo,hi); // we call this several times ...
      _function(_container,lo,hi); // ... because we measure ...
      _function(_container,lo,hi); // ... the performance ...
      _function(_container,lo,hi); // ... of _function, ....
      _function(_container,lo,hi); // ... not the performance ...
      _function(_container,lo,hi); // ... of get_value and ...
      _function(_container,lo,hi); // ... of the loop that ...
      _function(_container,lo,hi); // ... calls get_value.
   }

   void verify ( unsigned lo, unsigned hi ) {
      val_type value = _function(_container,lo,hi);
      if ( lo < hi ) {
         for ( unsigned i=lo; i<hi; i++ ) {
            val_type val = _container[i/entry_bits] >> i%entry_bits & 1u;
            if ( val != (value&1u) ) {
               printf("lo=%d hi=%d [%d] is'nt %d\n",lo,hi,i,(unsigned)val);
               _failed = true;
            }
            value >>= 1u;
         }
      }
      if ( value ) {
         printf("lo=%d hi=%d value contains high bits set to 1\n",lo,hi);
         _failed = true;
      }
   }

   void run ( fun_test fun ) {
      for ( unsigned lo=0; lo<total_bits; lo++ ) {
         unsigned h0 = 0;
         if ( lo > entry_bits ) h0 = lo - (entry_bits+1);
         unsigned h1 = lo+64;
         if ( h1 > total_bits ) h1 = total_bits;
         for ( unsigned hi=h0; hi<=h1; hi++ ) {
            (this->*fun)(lo,hi);
         }
      }
   }

   static uint64_t time_used ( ) {
      struct rusage ru;
      getrusage(RUSAGE_THREAD,&ru);
      struct timeval t = ru.ru_utime;
      return (uint64_t) t.tv_sec*1000 + t.tv_usec/1000;
   }

public:

   Test ( fun_type function ): _function(function), _failed() {
      val_type entry;
      unsigned index = 0; // position in the whole bit array
      unsigned value = 0; // last value assigned to a bit
      static char const entropy[] = "The quick brown Fox jumps over the lazy Dog";
      do {
         if ( ! (index%entry_bits) ) entry = 0;
         entry <<= 1;
         entry |= value ^= 1u & entropy[index/7%sizeof(entropy)] >> index%7;
         ++index;
         if ( ! (index%entry_bits) ) _container.push_back(entry);
      } while ( index < total_bits );
   }

   bool correctness() {
      _failed = false;
      run(&Test::verify);
      return !_failed;
   }

   void performance() {
      uint64_t t1 = time_used();
      for ( unsigned i=0; i<999; i++ ) run(&Test::get_value);
      uint64_t t2 = time_used();
      printf("used %d ms\n",(unsigned)(t2-t1));
   }

   void operator() ( char const * name ) {
      printf("testing %s\n",name);
      correctness();
      performance();
   }

};

int main()
{
   typedef typename std::vector<uint64_t> Container;
   Test<Container> test(get<Container>); test("get");
   Test<Container> tost(got<Container>); tost("got");
}

The 3rd part, main.cpp above, contains a class of unit tests and applies them to get.h and got.h, that is, to my solution and to the original code of the question, slightly modified. The unit tests verify correctness and measure speed. They verify correctness by creating a container of 256 bits, filling it with some data, reading all possible bit sections of up to as many bits as fit into a container entry plus lots of pathological cases, and verifying the correctness of each of the results. They measure speed by again reading the same sections quite often and reporting the thread's time used in user space.

Hans Klünder
  • 2,176
  • 12
  • 8