Fastest way of counting occurences of 1's in multiple std::bitset?

Question

I wanna count the occurences of 1 in multiple bitsets at same position. The count of each position is stored in a vector.

E.g.

b0 = 1011
b1 = 1110
b2 = 0110
     ----
 c = 2231 (1+1+0,0+1+1,1+1+1,1+0+0)

I could do that easily with code below, but this code seems to lack of performance, but I'm not sure. So my question is easily: Is there a faster way to count the 1?

#include <bitset>
#include <vector>
#include <iostream>
#include <string>

int main(int argc, char ** argv)
{
  std::vector<std::bitset<4>> bitsets;
  bitsets.push_back(std::bitset<4>("1011"));
  bitsets.push_back(std::bitset<4>("1110"));
  bitsets.push_back(std::bitset<4>("0110"));

  std::vector<unsigned> counts;

  for (int i=0,j=4; i<j; ++i)
  {
    counts.push_back(0);
    for (int p=0,q=bitsets.size(); p<q; ++p)
    {
      if (bitsets[p][(4-1)-i]) // reverse order
      {
        counts[i] += 1;
      }
    }
  }

  for (auto const & count: counts)
  {
      std::cout << count << " ";
  }
}

for (int i=0,j=4; i<j; ++i)
{
  for (int p=0,q=b.size(); p<q; ++p)
  {
    if(b[p][i])
    {
      c[p] += 1;
    }
  }
}

Forgot this online compiler [link](https://ideone.com/c3pwLI) — user1587451, Jul 04 '16 at 11:48
Some things: 1. Why use a `std::vector` instead of an `std::array` since all your bitset have a compile-time constant size? And if you need `std::vector`, initialized it with the correct size instead of using `push_back`. 2. You can probably get a bit faster by swapping the two loops (make the inner loop the outer loop) - This would avoid having to load the same piece of memory multiple times. — Holt, Jul 04 '16 at 11:58
Do you really need `std::bitset<>`? If not, why not using a simple char and put the binary in it and do some simple low level bit operations? — ckruczek, Jul 04 '16 at 11:59
The ```std::bitset``` are stored in a vector as I don't know the size at compile time. Further ```std::bitset``` are used as logically operations are performed. — user1587451, Jul 04 '16 at 12:01

Jeremy · Answer 1 · 2016-07-04T13:34:13.213

A table-driven approach. It obviously has its limits*, but depending on the application could prove quite suitable:

#include <array>
#include <bitset>
#include <string>
#include <iostream>
#include <cstdint>

static const uint32_t expand[] = {
        0x00000000,
        0x00000001,
        0x00000100,
        0x00000101,
        0x00010000,
        0x00010001,
        0x00010100,
        0x00010101,
        0x01000000,
        0x01000001,
        0x01000100,
        0x01000101,
        0x01010000,
        0x01010001,
        0x01010100,
        0x01010101
};

int main(int argc, char* argv[])
{
        std::array<std::bitset<4>, 3> bits = {
            std::bitset<4>("1011"),
            std::bitset<4>("1110"),
            std::bitset<4>("0110")
        };

        uint32_t totals = 0;

        for (auto& x : bits)
        {
                totals += expand[x.to_ulong()];
        }

        std::cout << ((totals >> 24) & 0xff) << ((totals >> 16) & 0xff) << ((totals >> 8) & 0xff) << ((totals >> 0) & 0xff) << std::
endl;
        return 0;
}

Edit:: * Actually, it's less limited than one might think...

score 0 · Answer 2 · answered Jul 04 '16 at 12:02

I would personnaly transpose the way your order your bits.

1011              110
1110    becomes   011
0110              111
                  100

Two main reasons : you can use stl algorithms and can have data locality for performance when you work on bigger size.

#include <bitset>
#include <vector>
#include <iostream>
#include <string>
#include <iterator>

int main()
{
    std::vector<std::bitset<3>> bitsets_transpose;  
    bitsets_transpose.reserve(4);
    bitsets_transpose.emplace_back(std::bitset<3>("110"));
    bitsets_transpose.emplace_back(std::bitset<3>("011"));
    bitsets_transpose.emplace_back(std::bitset<3>("111"));
    bitsets_transpose.emplace_back(std::bitset<3>("100"));

    std::vector<size_t> counts;
    counts.reserve(4);
    for (auto &el : bitsets_transpose) {
        counts.emplace_back(el.count()); // use bitset::count()
    }

    // print counts result
    std::copy(counts.begin(), counts.end(), std::ostream_iterator<size_t>(std::cout, " "));
}

Live code

Output is

2 2 3 1

If you're going to hand-preprocess the data, why not go the whole hog? e.g. ` std::cout << "2 2 3 1\n"; ` — Jeremy, Jul 04 '16 at 12:08
@Jeremy Don't get your comment. Transposing a matrix before computations is a well known method. So if he can change the way he order his data at input it is food for him. If not, this answer can still be an interesting method to use, and OP should implement a transpose method and then measure performance. — coincoin, Jul 04 '16 at 12:09
Well, I was being somewhat facetious, and I agree that transposing the data is a good approach to consider. But your example skips the transposition stage, which _could_ turn out to be just as expensive as tallying up the data in situ. Of course, it largely depends on how much flexibility you have in specifying the form of the input data. — Jeremy, Jul 04 '16 at 12:41

score 0 · Answer 3 · answered Jul 04 '16 at 12:05

Refactoring to separate counting logic from vector management allows us to inspect the efficiency of the counting algorithm:

#include <bitset>
#include <vector>
#include <iostream>
#include <string>
#include <iterator>

__attribute__((noinline))
void count(std::vector<unsigned> counts, 
           const std::vector<std::bitset<4>>& bitsets)
{
  for (int i=0,j=4; i<j; ++i)
  {
    for (int p=0,q=bitsets.size(); p<q; ++p)
    {
      if (bitsets[p][(4-1)-i]) // reverse order
      {
        counts[i] += 1;
      }
    }
  }
}

int main(int argc, char ** argv)
{
  std::vector<std::bitset<4>> bitsets;
  bitsets.push_back(std::bitset<4>("1011"));
  bitsets.push_back(std::bitset<4>("1110"));
  bitsets.push_back(std::bitset<4>("0110"));

  std::vector<unsigned> counts(bitsets.size(), 0);

  count(counts, bitsets);

  for (auto const & count: counts)
  {
      std::cout << count << " ";
  }
}

gcc5.3 with -O2 yields this:

count(std::vector<unsigned int, std::allocator<unsigned int> >, std::vector<std::bitset<4ul>, std::allocator<std::bitset<4ul> > > const&):
        movq    (%rsi), %r8
        xorl    %r9d, %r9d
        movl    $3, %r10d
        movl    $1, %r11d
        movq    8(%rsi), %rcx
        subq    %r8, %rcx
        shrq    $3, %rcx
.L4:
        shlx    %r10, %r11, %rsi
        xorl    %eax, %eax
        testl   %ecx, %ecx
        jle     .L6
.L10:
        testq   %rsi, (%r8,%rax,8)
        je      .L5
        movq    %r9, %rdx
        addq    (%rdi), %rdx
        addl    $1, (%rdx)
.L5:
        addq    $1, %rax
        cmpl    %eax, %ecx
        jg      .L10
.L6:
        addq    $4, %r9
        subl    $1, %r10d
        cmpq    $16, %r9
        jne     .L4
        ret

Which does not seem at all inefficient to me.

score 0 · Answer 4 · answered Jul 04 '16 at 12:13

There are redundant memory reallocations and some other code in your program. For example before using method push_back you could at first reserve enough memory in the vector.

The program could look the following way.

#include <iostream>
#include <bitset>
#include <vector>

const size_t N = 4;

int main() 
{
    std::vector<std::bitset<N>> bitsets = 
    { 
        std::bitset<N>( "1011" ), 
        std::bitset<N>( "1110" ),
        std::bitset<N>( "0110" )
    };

    std::vector<unsigned int> counts( N );

    for ( const auto &b : bitsets )
    {
        for ( size_t i = 0; i < N; i++ ) counts[i] += b[N - i -1]; 
    }

    for ( unsigned int val : counts ) std::cout << val;
    std::cout << std::endl;

    return 0;
}

Its output is

Fastest way of counting occurences of 1's in multiple std::bitset?

4 Answers4