Writing to and reading from a hashtable simultaneously

Question

It seems like I should be able to implement a vector-type object that I can insert into and read from simultaneously like so:

If there is space in the vector I can just insert things; this shouldn't interfere with reading.
If I have to reallocate, I can allocate, then copy, then update the pointer to the data, then release.
If I want to read from the vector, I just have to ensure that fetching the pointer to the data and reading from it are done atomically.

In this way if I read from the vector while it's reallocating I just read from the old location, which is still valid. (Of course deletion won't be threadsafe, but that's fine; you'll just have to take that into account if you want to delete things, which is at any rate no worse than what you have with std::vector right now.)

In turn, I should be able to adapt this into a hashtable without too much trouble -- just use one of these vectors for the buckets, and back each bucket with one of these vectors. (I realize you should back the buckets with some kind of self-balancing binary tree to get optimal asymptotic complexity, but vectors are fine for my application and I don't want to take things too far off course here.)

Two questions:

Does this make sense or am I missing something? (I don't trust my intuition for thread safety.)
If so, is it possible to build this or something like it using some of the containers in the C++ standard library as primitives, or would my only play be to write the whole thing from scratch? (I imagine I'd use std::atomic in a few places, of course, but is there some way to use things like std::vector or std::unordered_map here?)

Alternatively, is there a book or something I can read on the subject?

How can you guarantee that no other threads are still accessing the data you delete? — Jeremy Friesner, May 11 '20 at 17:29
You don't need to write everything from scratch. There are many existing solutions for concurrent hash tables, some of them are very efficient. But usually at some price. For instance we use [this custom implementation](https://gitlab.com/tdytrych/lsu3shell/-/blob/master/libraries/LookUpContainers/HashFixed.h) in our application, which is highly efficient, but you need to specify the maximum member of elements in advance and there is no possibility to remove elements. Some implementations support only some kinds of keys (mostly to work with them in an atomic way) etc. — Daniel Langr, May 11 '20 at 17:33
@DanielLangr: That made me sign up, and then gave me a 404 after I created an account. — Daniel McLaury, May 11 '20 at 17:34
@DanielMcLaury Sorry, look [here](https://sourceforge.net/p/lsu3shell/code/ci/master/tree/libraries/LookUpContainers/HashFixed.h). Facebook Folly has `ConcurrentHashMap` and `AtomicHashMap`, Intel TBB has `concurrent_hash_map`, just to name some. — Daniel Langr, May 11 '20 at 17:36
@JeremyFriesner: When I'm trying to read during a reallocate? That's why I said that grabbing and dereferencing the pointer to the data would have to be done as a single atomic operation. Am I missing something? — Daniel McLaury, May 11 '20 at 17:36
`std::atomic` makes an operation atomic. It doesn't make a transaction atomic. The classic mistake is you atomically check to see if a value exists, then atomically read the value. Unfortunately between the check and the read another thread comes along and consumes the value. Boom. — user4581301, May 11 '20 at 17:37
@user4581301: Yeah, I know that I can't just wrap the value with `std::atomic` and be done with it, but as I understand it I ought to be able to use `std::atomic` as a primitive to build a lockless grab-and-dereference. — Daniel McLaury, May 11 '20 at 17:41
@DanielMcLaury You cannot read a pointer and pointed to data as a single atomic operation, since it involves multiple memory locations. — Daniel Langr, May 11 '20 at 17:42
@DanielLangr: Not directly, no, but unless I'm mistaken I should be able to use `std::atomic` as a primitive to create an atomic transaction which does it. Although perhaps that ends up being worse than just locking a mutex, I don't know. — Daniel McLaury, May 11 '20 at 17:43
@DanielMcLaury An _atomic_ transaction protected by a mutex (lock) internally? Yes, you can protect whole vector/hash map by a mutex, but this would be extremely inefficient (recall that `std::atomic` does not guarantee lock-free operations). Why simply not use some of existing solution that have been developed and fine-tuned for a long time? Do you thing you can do better? — Daniel Langr, May 11 '20 at 17:46
Re, "...just use one of these [thread-safe] vectors for the buckets, and ..." I'm suspicious of how you said, "_just_." Do you understand why _just_ using thread-safe objects as components of a higher-level object will not automatically make the higher-level object thread safe? IMO your explanation above is heavy on implementation ideas, and light on requirements. Before you choose or invent a "solution," be sure to understand all of the different operations that you need to perform on the map and, what _guarantees_ you need from it (i.e., what does "thread-safe" mean for _your_ application?) — Solomon Slow, May 11 '20 at 19:52
I can only agree with @SolomonSlow, you should be sure that you know what you are doing, because getting this stuff right is extremely tricky. I know what I am talking about - I am working with/on concurrent data structures for ~10 years now. ;) FWIW: I can refer you to my [xenium library](https://github.com/mpoeter/xenium) which provides two concurrent hash maps - `harris_michael_hash_map` which is fully lock-free, and `vyukov_hash_map` which is blazingly fast, even though it uses fine-grained locking for insert/remove (lookup is lock-free though). — mpoeter, May 12 '20 at 08:49

horstr · Accepted Answer · 2020-05-11T21:23:26.287

The problem with writing thread-safe code is that it's very difficult to cover all possible scenarios that can occur as code is running concurrently. Most problematic of all is that home-grown thread-safe data structures may seem to be working as expected but then often fail at random in production.

Even more complex than lock-based algorithms are lock-free or wait-free algorithms. Lock-free algorithms guarantee that other threads can make progress even if one thread is suspended. Wait-free algorithms (which are lock-free) guarantee that all threads can make progress.

In addition to the actual algorithm, you always have to think about the platform you're implementing the algorithm for. Multi-threaded code depends on the compiler and the memory model of the processor, particularly if locks are not used. std::atomic provides platform-independent access to atomic primitives required for lock-free/wait-free algorithms. That doesn't make writing correct custom thread-safe data structures much easier, though.

The short answer is: don't do it.

The long answer:

The most important point is the exact scenario you need the data structure for. Based on that, you can derive the requirements and assess if it's feasible to implement it yourself. For the sake of understanding the underlying mechanisms of such implementations, experimenting makes sense. For the sake of production code, this usually comes back to haunt you and thus rarely wins.

As you cannot rely on undefined behavior of standard containers (behavior that cannot be implied by the interface contract), it's difficult to use them as a basis for your implementation. The documentation usually defines the expected behavior from a single-threaded POV. For multi-threading, though, you need to know internals to be able to rely on them - unless, of course, the data structure was implemented with concurrency in mind.

Getting back to your scenario: let's say what you need is a hash table with a fixed number of buckets that can be read from without blocking. Inserts may be serialized, deletions are not required. This is very common for caches.

As a building block, all you need is a single lock and a fixed number of linked lists that represent the hash table buckets and handle collisions.

The lookup algorithm would be as follows (pseudo-code):

node* lookup(key) {
  // concurrency issue (see below)
  node = buckets[hash(key)].find(key);
  if (node) {
    return node;
  }
  lock();
  node = buckets[hash(key)].find(key);
  if (node) {
    return node;
  }
  node = new node(key);
  // concurrency issue (see below)
  buckets[hash(key)].add(node);
  unlock();
  return node;
}

The hash table can be read without blocking, inserts are serialized. This only works if items are never removed from the buckets. Otherwise, you might access deallocated data.

There's a second caveat which is not immediately apparent and illustrates the complexity of writing multi-threaded code. This only works as expected if the newly created node is fully allocated and visible to other threads before its pointer is being inserted into the bucket. If that order is not maintained, readers might trigger a segmentation fault because they access a partially initialized node. The order is influenced by the compiler and CPU, both of which are free to reorder instructions as long as the behavior from the POV of single-threaded code does not change.

In this specific case, the order is highly relevant. Consequently, we need to inform both, the compiler and CPU, that the new must happen before the add. Also, the reader (find) needs to read the pointer before any other data. This is achieved by influencing the memory order of both operations. In C++11, representing the node pointer as a std::atomic<node*> and using load and store to read/write the pointer solves that issue because the default memory order is std::memory_order_seq_cst, which provides a sequential consistency guarantee. There's a more nuanced approach that might generate more efficient code (using std::memory_order_acquire for load and std::memory_order_release for store). You can also influence the order by placing so-called memory barriers/fences appropriately (those are implicitly triggered by the memory order arguments mentioned).

The reason why purely lock-based algorithms usually do not have to deal with memory ordering is that the locking primitives already trigger memory barriers/fences implicitly with every lock and unlock.

Long story short: if you there's no need to create your own thread-safe data structures, don't do it and instead rely on existing implementations that have been reviewed and tested thoroughly.

Thanks, this gets at what I guess is the key point here, which is that there's no hope of using `std` containers as primitives given that their behavior across threads isn't specified by the standard. — Daniel McLaury, May 12 '20 at 16:48

Writing to and reading from a hashtable simultaneously

1 Answers1