TL;DR
How can you find "unreachable keys" in a key/value store with a large amount of data?
Background
In comparison to relational database that provide ACID guarantees, NoSQL key/value databases provide fewer guarantees in order to handle "big data". For example, they only provide atomicity in the context of a single key/value pair, but they use techniques like distributed hash tables to "shard" the data across an arbitrarily large cluster of machines.
Keys are often unfriendly for humans. For example, a key for a blob of data
representing an employee might be
Employee:39045e87-6c00-47a4-a683-7aba4354c44a
. The employee might also have a
more human-friendly identifier, such as the username jdoe
with which the
employee signs in to the system. This username would be stored as a separate
key/value pair, where the key might be EmployeeUsername:jdoe
. The value for
key EmployeeUsername:jdoe
is typically either an array of strings containing
the main key (think of it like a secondary index, which does not necessarily
contain unique values) or a denormalised version of employee blob (perhaps
aggregating data from other objects in order to improve query performance).
Problem
Now, given that key/value databases do not usually provide transactional
guarantees, what happens when a process inserts the key
Employee:39045e87-6c00-47a4-a683-7aba4354c44a
(along with the serialized
representation of the employee) but crashes before inserting the
EmployeeUsername:jdoe
key? The client does not know the key for the employee
data - he or she only knows the username jdoe
- so how to you find the
Employee:39045e87-6c00-47a4-a683-7aba4354c44a
key?
The only thing I can think of is to enumerate the keys in the key/value store
and once you find the appropriate key, "resume" the indexing/denormalisation.
I'm well aware of techniques like event sourcing, where an idempotent event
handler could respond to the event (e.g., EmployeeRegistered
) in order to
recreate the username-to-employee-uuid secondary index, but using event
sourcing over key/value store still requires enumeration of keys, which could
degrade performance.
Analogy
The more experience I have in IT, the more I see the same problems being tackled in different scenarios. For example, Linux filesystems store both file and directory contents in "inodes". You can think of these as key/value pairs, where the key is an integer and the value is the file/directory contents. When writing a new file, the system creates an inode and fills it with data THEN modifies the parent directory to add the "filename-to-inode" mapping. If the system crashes after creating the file but before referencing it in the parent directory, your file "exists on disk" but is essentially unreadable. When the system comes back online, hopefully it will place this file into the "lost+found" directory (I imagine it does this by scanning the entire disk). There are plenty of other examples (such as domain name to IP address mappings in the DNS system), but I specifically want to know how the above problem is tackled in NoSQL key/value databases.
EDIT
I found this interesting article on manual secondary indexes but it doesn't "broken" or "dated" secondary indexes.