Choosing between set vs. vector vs. vector to use as a bitmap (bitset / bit array)

Question

Given a range of indexes (identifiers), where I want to map each index to a boolean value, that is:

// interface pseudocode
interface bitmap {
  bool identifier_is_set(unsigned int id_idx) const;
  void set_identifier(unsigned int id_idx, bool val) const;
};

so that I can set and query for each ID (index) if it is set or not, what would you prefer to use to implement this?

I think this is called a bit array or bitmap or bitset, correct me if I'm wrong.

Assume that the maximum identifier is predetermined and not greater than 1e6 (1m), possibly much smaller (10k - 100k). (Which means the size used by sizeof(int)*maximum_id_idx easily fits into memory.)

Possible solutions I see so far:

std::set<size_t> - Add or erase the identifier to this set as neccessary. This would allow for arbitrarily large identifiers as long as we have a sparse bitmap.
std::vector<bool> - Sized to the appropriate maximum value, storing true or false for each id_idx.
std::vector<char> - Same thing, but not suffering from weird std::vector<bool> problems. Uses less memory than vector<int>.
std::vector<int> - Using an int as the boolean flag to have a container using the natural word size of the machine. (No clue if that could make a difference.)

Please answer which container type you would prefer and why, given the maximum id restriction cited above and especially considering performance aspects of querying the bitmap (inserting performance does not matter).

Note: The interface usage of vector vs. set does not matter, as it will be hidden behind it's wrapping class anyway.

EDIT: To add to the discussion about std::bitset : std::bitset will incorporate the whole array size into the object, that is a sizeof(std::bitset<1m>) will be a size of approx 1/8 megabyte, which makes for a huge single object and makes for something you cannot put on the stack anymore (which may or may not be relevant).

Whenever performance matters and you can't make an "obvious" choice initially, you must run tests comparing equivalent behavior. — , Nov 11 '10 at 16:37

score 3 · Accepted Answer · answered Nov 11 '10 at 21:19

Without knowing the platform you are running this code on and your access patterns, it's hard to say whether vector<bool> will be faster than vector<char> (or vector<int>) or even set<int> or unordered_set<int>.

For example, if you have an extremely sparse array, a linear search of a vector<int> that just contains the indices set might be the best answer. (See Mike Abrash's article on optimizing Pixomatic for x86.)

On the other hand, you might have a somewhat sparse array. By somewhat sparse, I mean that the number of set elements is much greater than L1 or L2. In that case, more low-level details start to come into play, as well as your actual access patterns.

For example, on some platforms, variable bit shifting is incredibly expensive. So, if you are querying a random set of identifiers, the more frequently you do this, the more a vector<char> or vector<int> becomes a better idea than bitset<...> or vector<bool>. (The latter two use bit shifts to lookup bits.) On the other hand, if you are iterating through the sparse bit vector in order and just want the bits set, you can optimize that iteration to get rid of the overhead of variable shifts.

At this point, you might also want to know how your sparse identifiers are actually distributed. If they are clumped, you need to know the tradeoff between the optimal memory read size and reading a char at a time. That will dictate whether hitting the cache more often will offset reading in non-native sized data.

If the identifiers are scattered, you may get a significant win by using a hash set (unordered_set<int>) instead of a bit vector. That depends on the load, however.

score 2 · Answer 2 · answered Nov 11 '10 at 15:16

2

Have you checked out boost::dynamic_bitset?

http://www.boost.org/doc/libs/1_36_0/libs/dynamic_bitset/dynamic_bitset.html

answered Nov 11 '10 at 15:16

Moo-Juice

38,257
10
78
128

Thanks. I do not need any of the set operations dynamic_bitset defines, so it seems this class would just add overhead where none is needed. – Martin Ba Nov 11 '10 at 15:21
dynamic_bitset is just like std::bitset but can be resized. The extra functionality doesn't cost if you don't use it. – CashCow Nov 11 '10 at 15:22
@Martin: I concur with @CashCow. This is an advantage of templates; the code for those functions won't be generated if you don't instantiate them. – Steve M Nov 11 '10 at 15:34
1

Using this does not buy you anything over regular `bitset` when maximum size is predetermined. – Steve Townsend Nov 11 '10 at 15:49

score 2 · Answer 3 · answered Nov 11 '10 at 15:22

2

Assume that the maximum identifier is predetermined and not greater than 1e6 (1m)

Use a std::bitset if you have a hard limit:

std::bitset<1000000> bits;
bits.set(1000);

answered Nov 11 '10 at 15:22

Steve M

8,246
2
25
26

score 1 · Answer 4 · edited Nov 11 '10 at 16:30

1

If by performance you mean the one that is the quickest to look up then std::bitset is probably fast enough as its lookup is constant-time. There is an initial overhead to set all the bits to zero. vector<int> would probably be unnoticeably faster and would have a bigger overhead to set the bits as there are 32-times as many of them in a 32-bit system.

vector<bool> is like bitset in its implementation, and has the advantage of being resizeable if you need that, although in general I would avoid vector, and use boost's dynamic_bitset if I need to resize.

std::set would be O(log N) in lookup and insertion/deletion although it is the most scalable in memory use, occupying less if the set is not particularly full. std::set is not restricted in range.

some form of hash is also an option if your data is more sparse, generally O(1) setting and lookup although there may be some overhead with collision-handling.

edited Nov 11 '10 at 16:30

answered Nov 11 '10 at 15:21

CashCow

30,981
5
61
92

std::bitset is not appropriate as the maximum size can vary btw. approx. 10k and 1m at runtime – Martin Ba Nov 11 '10 at 15:23
@Martin: If you know what the maximum size is, then bitset is still going to work and it's going to be faster than any of the dynamic options. – Steve M Nov 11 '10 at 15:26
Steve : Please see http://stackoverflow.com/q/4156538/321013 as I do not get how a `std::bitset` with arbitrary large size can be faster in lookup that a `std::vector` – Martin Ba Nov 11 '10 at 16:17
1

For resizable bitsets you can use boost::dynamic_bitset (mentioned near the end of the 2nd paragraph). – CashCow Nov 11 '10 at 16:23

score 0 · Answer 5 · answered Nov 11 '10 at 15:19

0

The fastest seems to be using bitmask. You should construct an std::vector<int>, and make its size adequate (N divided by sizeof(int)*8, rounded up).

This seems to be faster than std::vector<bool> (or similar) for large sets of data. Because you actually use much less memory, hence cache utilization is better

answered Nov 11 '10 at 15:19

valdo

12,632
2
37
67

1

How would `std::vector` used as a bitmask be *any* different from a `std::vector` that's already supposed to do that? – Martin Ba Nov 11 '10 at 15:25

score 0 · Answer 6 · answered Nov 11 '10 at 16:58

You could always have a std::vector<std::bitset<sizeof(size_t)> >, then your lookup is simple calculation (though the modulo operation is relatively slow), but you have the advantage of this being able to grow... I would hazard that space wise, the above is probably the most optimal as well...

Choosing between set vs. vector vs. vector to use as a bitmap (bitset / bit array)

6 Answers6

Linked