0

I work on GPL'ed C++ code with heavy data processing. One particular pattern we often have is to collect some amount (thousands to millions) of keys or key/value pairs (usually int32..int128), insert them into hashset/hashmap and then use it without further modifications.

I named it immutable hashtable, although single-assignment hashtable may be even a better name since we don't use it prior to full construction.

Today we are using STL unordered_map/set, but we are looking for a better (especially faster) library. Can you recommend anything suitable for the situation, with GPL-compatible license?

I think that the most efficient approach would be to radix-sort all keys by the bucket num and provide bucket->range mapping, so we can use the following code to search for a key:

bool contains (set,key) {
  h = hash(key);
  b = h % BUCKETS;
  for (i : range(set.bucket[b], set.bucket[b+1]-1)
    if (set.keys[i]==key)  return true;
  return false;
}

Your comments on this approach? Can you propose a faster way to implement immutable map/set?

Bulat
  • 2,435
  • 1
  • 15
  • 15
  • There are faster implementations out there. Also, being immutable doesn’t necessarily improve retrieval performance. – user2864740 Feb 23 '20 at 18:25
  • I voted to close because the question is currently focused around finding a faster container implementation (“Can you recommend anything suitable for the situation..?”), which might be better served by searching.. finding “faster” involves performance benchmarks, over relevant data and usages, on a specific build/hardware, not guessing. – user2864740 Feb 23 '20 at 18:26
  • Faster mutable implementations is another question, I'm looking specifically for ones that are either immutable or especially efficient for my specific scenario. – Bulat Feb 23 '20 at 18:32
  • See if a [Bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) would help in your problem. – Igor Tandetnik Feb 23 '20 at 18:33
  • I would equally welcome pointers to benchmark results, or raw pointers to libraries. We will make our own benchmarks, but I need to find what to benchmark first. – Bulat Feb 23 '20 at 18:35
  • Thanks, I know about Bloom filters. – Bulat Feb 23 '20 at 18:35
  • It confuses the question to conflate “immutable” with “faster .. for my specific scenario” (which is *not well-defined*). There should be a clear primary focus. One can always trivially turn a mutable implementation into a write-once implementation, regardless of any different performance characteristics that an “immutable” type might be able to offer. – user2864740 Feb 23 '20 at 19:56
  • @user2864740 How can I improve it? My scenario deals with single-build hashtables, and my goal is to find fastest implementation(s). I tried to reveal as much as possible info about my usecases to get more focused answers. I will do benchmarks on my own, but I'm seeking for any suggestions, either in form of existing libs or algorithms+data structures to implement. – Bulat Feb 23 '20 at 20:42
  • 1
    As far as a *general* implementation, there is Robin Hashmap. Various benchmarks by the author can be found here - https://tessil.github.io/2016/08/29/benchmark-hopscotch-map.html The link is a bit dated, and is still useful as a starting place for additional research. However, there might be a *specific implementation* for a given problem which might be more suitable given other (actual) information. – user2864740 Feb 23 '20 at 21:28

1 Answers1

0

I think, for your case is more suitable Double Hashing or Robin Hood Hashing. Among lot of possible algorithms, I prefer to use Double Hashing with 2^n table and odd step. This algorithm very efficient and easy to code. Following is just an example of such container for uint32_t keys:

class uint32_DH {
  static const int _TABSZ = 1 << 20; // 1M cells, 2^N size
  public:
  uint32_DH() { bzero(_data, sizeof(_data)); }
  bool search(uint32_t key) { return *lookup(key) == key; }
  void insert(uint32_t key) { *lookup(key) = key; }
  private:
  uint32_t* lookup(uint32_t key) {
    uint32_t pos  = key + (key >> 32) * 7919;
    uint32_t step = (key * 7717 ^ (pos >> 16)) | 1;
    uint32_t *rc;
    do {
      rc = _data + ((pos += step) & (_TABSZ - 1)); 
    } while(*rc != 0 && *rc != key);
    return rc;
  }
  uint32_t _data[_TABSZ]; 
}
olegarch
  • 3,670
  • 1
  • 20
  • 19