How to create efficient bit set structure for big data?

Question

Java's BitSet is in memory and it has no compression in it.

Say I have 1 billion entries in bit map - 125 MB is occupied in memory. Say I have to do AND and OR operation on 10 such bit maps it is taking 1250 MB or 1.3 GB memory, which is unacceptable. How to do fast operations on such bit maps without holding them uncompressed in memory?

I do not know the distribution of the bit in the bit-set.

I have also looked at JavaEWAH, which is a variant of the Java BitSet class, using run-length encoding (RLE) compression.

Is there any better solution ?

Why would you have to hold the 10 bit maps in memory if AND and OR only take 2 bit maps as arguments ? — jean-loup, Jul 15 '14 at 13:18
Regarding your BitSet as a set of integers, how sparse is it? That is, how many of the billion integers in the BitSet range are present in the set? — Patricia Shanahan, Jul 15 '14 at 15:20
@PatriciaShanahan say it is not parsed means say all entries are like 101010101 like such series as major part. — Neel Salpe, Jul 15 '14 at 16:22
The Java BitSet data structure is very close to optimal for uniformly distributed bitsets. If you want to do better, you need to know something about how your bitsets are distributed. — tmyklebu, Jul 15 '14 at 17:36
In that situation, my next move would be measurement and analysis to find out a lot more about the data and its uses. — Patricia Shanahan, Jul 16 '14 at 14:23
AND/OR are sequential operations, I don't see why you can't just store the bits in a file and then just stream over both at the same time and AND/OR the read bytes. — Thomas Jungblut, Jul 23 '14 at 15:07
You may want to look into [roaring bitmap](http://roaringbitmap.org/). The companion paper contains various comparison and exploration of existing compressed bitmap implementations which is a useful starting point for further research. — Ze Blob, Jul 27 '14 at 02:03
If you have a sparse bit set (which is usually the case for very large sets) then there are several techniques to deal with that efficiently, by compressing out the zero regions. The trick is to have a scheme which is efficient (and doesn't require decompression/recompression) when doing ANDs and ORs. — Hot Licks, Jul 30 '14 at 12:33

score 2 · Answer 1 · edited May 23 '17 at 10:28

One solution is to keep the arrays off the heap.

You'll want to read this answer by @PeterLawrey to a related question.

In summary the performance of Memory-Mapped files in Java is quite good and it avoids keeping huge collections of objects on the heap.

The operating system may limit the size of a individual memory mapped region. Its easy to work around this limitation by mapping multiple regions. If the regions are fixed size, simple binary operations on the entities index can be used to find the corresponding memory mapped region in the list of memory-mapped files.

Are you sure you need compression? Compression will trade time for space. Its possible that the reduced I/O ends up saving you time, but its also possible that it won't. Can you add an SSD?

If you haven't yet tried memory-mapped files, start with that. I'd take a close look at implementing something on top of Peter's Chronicle.

If you need more speed you could try doing your binary operations in parallel.

If you end up needing compression you could always implement it on top of Chronicle's memory mapped arrays.

This is exactly what I would suggest. You want to manipulate 1.3GB data but don't want to occupy that amount of memory then use Memory-Mapped files which will give you a smaller window into your data in file. Also use concurrent threads to do your operations in parallel using fork-join or parallel streams. — Gladwin Burboz, Jul 29 '14 at 20:24
Memory Mapped files are _THE_ best way to efficiently answer the original question. I see that the wording of the question has since been edited to make compression a required part of any solution. If I knew how to down-vote an edit I would do it. — Ryan, Aug 01 '14 at 18:10

score 0 · Answer 2 · answered Jul 23 '14 at 14:32

From the comments here what I would say as a complement to your initial question :

the bit fields distribution is unknown and so BitSet is probably the best we can use
you have to use the bit fields in different modules and want to cache them

That being said, my advice would be to implement a dedicated cache solution, using a LinkedHashMap with access order if LRU is an acceptable eviction strategy, and having a permanent storage on disk for the BitSetS.

Pseudo code :

class BitSetHolder {

    class BitSetCache extends LinkedHashMap<Integer, Bitset> {
        BitSetCache() {
            LinkedHashMap(size, loadfactor, true); // access order ...
        }

        protected boolean removeEldestEntry(Map.Entry eldest) {
            return size() > BitSetHolder.this.size; //size is knows in BitSetHolder
        }
    }
    BitSet get(int i) { // get from cache if not from disk
        if (bitSetCache.containsKey(i) {
             return bitSetCache.get(i);
        }
        // if not in cache, put it in cache
        BitSet bitSet = readFromDisk();
        bitSetCache.put(i, bitSet);
        return bitSet();
    }
}

That way :

you have transparent access to you 10 bit sets
you keep in memory the most recently accessed bit sets
you limit the memory to the size of the cache (the minimum size should be 3 if you want to create a bitset be combining 2 others)

If this is an option for your requirements, I could develop a little more. Anyway, this is adaptable for other eviction strategy, LRU being the simplest as it is native in LinkedHashMap.

score 0 · Answer 3 · edited May 23 '17 at 12:18

The best solution depends a great deal on the usage patterns and structure of the data.

If your data has some structure beyond a raw bit blob, you might be able to do better with a different data structure. For example, a word list can be represented very efficiently in both space and lookup time using a DAG.

Sample Directed Graph and Topological Sort Code

BitSet is internally represented as a long[], which makes it slightly more difficult to refactor. If you grab the source out of the openjdk, you'd want to rewrite it so that internally it used iterators, backed by either files or in-memory compressed blobs. However, you have to rewrite all the loops in BitSet to use iterators, so the entire blob never has to be instantiated.

http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/BitSet.java

How to create efficient bit set structure for big data?

3 Answers3