3

I have a bitset which i am using to track whether an item is present or not example

b = 01100110000

it represents that 2nd and 3rd items are present and 1st and 4th item are not present.

While searching for library which can optimise this bitset array. I came across Roaring bitmaps which sounded very exciting.

I did a quick test with it,

    public static void main(String[] args) throws IOException {
        RoaringBitmap roaringBitMap = new RoaringBitmap();
        BitSet bitSet = new BitSet(5000);
        double prob = 0.001;
        Random random = new Random();
        for (int i = 0; i < 5000; i++) {
            if (random.nextDouble() < prob) {
                bitSet.set(i);
                roaringBitMap.add(i);
            }
        }
        System.out.println(bitSet.cardinality());
        System.out.println("bitset bytes: "+ bitSet.size());
        System.out.println("RoaringBitmap bytes: " + roaringBitMap.getSizeInBytes() * 8);
    }

Basically we are setting some values and check overall size of data structure.

when we run this with multiple prob values. I got

prob byte bitset bytes RoaringBitmap bytes
0.001 5056 288
0.01 5056 944
0.1 5056 7872
0.999 5056 65616

If you see as we insert more and more numbers, the memory footprint of RoaringBitmap increases.

  1. Is this expected?
  2. In the worst case should it not just fall back to bitset based implementaiton?
  3. can't 0.999 be treated as inverse of 0.001 and we would be able to store it in 288 bytes?
  4. What is the most optimal way to represent these bitset as String when we are making inter service calls and using jackson library (but not byte based serialisation libraries)
best wishes
  • 5,789
  • 1
  • 34
  • 59
  • The [api docs](https://www.javadoc.io/static/org.roaringbitmap/RoaringBitmap/0.9.30/org/roaringbitmap/RoaringBitmap.html#getLongSizeInBytes()) actually describe the memory footprint – g00se Jun 28 '22 at 14:24
  • i did read that, but if you think about it, you can limit your worst case to bitset plus some metadata overhead. Why would we go so much above bitset is my question. – best wishes Jun 28 '22 at 14:35
  • Not sure what `add` is really doing. It *could* be doing something like a call to `StringBuilder.append`, whereby storage allocation is jumping by a factor other than one. There seems to be no `RoaringBitmap` which creates a bitmap for a finite number of bytes. As for the `String` thing, fyi the visualization of every bit of the `BitSet` gzips to 69 bytes for me – g00se Jun 28 '22 at 15:11

3 Answers3

2

this seems to be the case when number of entries are small, But as we increase the number of entries, the different becomes less visible. Although it is not confirmed by the lib author ( i asked here and followed up here)

prob number of entries bitset bits RoaringBitmap bits saving %
0.001 50000 50048 928 98
0.01 50000 50048 7744 84
0.1 50000 50048 65616 -31
0.999 50000 50048 65616 <- NOTE it does not increase -31
0.001 500000 500032 8704 98
0.01 500000 500032 80720 83
0.1 500000 500032 524480 -4
0.999 500000 500032 524480 <- NOTE it does not increase -4
0.001 50000000 50000000 835232 98
0.01 50000000 50000000 8036368 83
0.1 50000000 50000000 50016240 -0.03
0.999 50000000 50000000 50016240 <- NOTE it does not increase -0.03

looking at this it seems like as number of entries grows they might be using bitmap only behind the scene. The take away is that don't blindly use the library, test for your use case.

best wishes
  • 5,789
  • 1
  • 34
  • 59
  • This is expected . Once you have enough density of data the roaring bitmap fallsback to bitmap and so your overall memory usage would only be marginally larger than bitmap. – user179156 Sep 20 '22 at 09:15
2

The Roaring Bitmap format has a public specification:

https://github.com/RoaringBitmap/RoaringFormatSpec

The memory usage is only one factor in an application's performance. Roaring bitmaps seek to provide economical storage, while also providing high performance in real-world applications.

Given N integers in [0,x), then the serialized size in bytes of a Roaring bitmap should never exceed this bound:

8 + 9 * ((long)x+65535)/65536 + 2 * N

That is, given a fixed overhead for the universe size (x), Roaring bitmaps never use more than 2 bytes per integer.

There is no such thing as a data structure that is always ideal. You should make sure that Roaring bitmaps fit your application profile. There are at least two cases where Roaring bitmaps can be easily replaced by superior alternatives compression-wise:

You have few random values spanning in a large interval (i.e., you have a very sparse set). For example, take the set 0, 65536, 131072, 196608, 262144 ... If this is typical of your application, you might consider using a hash set or a simple sorted array.

You have dense set of random values that never form runs of continuous values. For example, consider the set 0,2,4,...,10000. If this is typical of your application, you might be better served with a conventional bitset.

Relevant references about Roaring

Daniel Lemire
  • 3,470
  • 2
  • 25
  • 23
0

RoaringBitmap has three types of containers, and in our scenarios, we mostly use BitmapContainer. Each BitmapContainer can store 65535 elements, occupying 8k of memory. However, the storage is split according to the high and low 16 bits. The corresponding bucket is selected based on the high 16 bits. If the corresponding bucket does not exist (the element range difference is 65535), it will be stored in a new bucket. In the case of overly sparse data, this can lead to many buckets that are not filled up, resulting in some redundant memory. This can cause RoaringBitmap to use more memory than a regular bitmap in some scenarios. If the range difference of sparse data is 65535 or the number of elements is small, it will use less memory than Bitmap. Its biggest advantage is that it does not need to allocate space according to the maximum offset.

Halil Ozel
  • 2,482
  • 3
  • 17
  • 32