As a research project I'm currently writing a document-oriented database from scratch in Python. Like MongoDB, the database supports the creation of indexes on arbitrary document keys. These indexes are currently implemented using two simple dictionaries: The first contains as key the (possibly hashed) value of the indexed field and as value the store keys of all documents associated with that field value, which allows the DB to locate the document on disk. The second dictionary contains the inverse of that, i.e. as a key the store key of a given document and as value the (hashed) value of the indexed field (which makes removing document from the index more efficient). An example:
doc1 = {'foo' : 'bar'} # store-key : doc1
doc2 = {'foo' : 'baz'} # store-key : doc2
doc3 = {'foo' : 'bar'} # store-key : doc3
For the foo
field, the index dictionaries for these documents would look like this:
foo_index = {'bar' : ['doc1','doc3'],'baz' : ['doc2']}
foo_reverse_index = {'doc1' : ['bar'],'doc2' : ['baz'], 'doc3' : ['bar']}
(please not that the reverse index also consists of lists of values [and not single values] to accommodate indexing of list fields, in which case each element of the list field would be contained in the index separately)
During normal operation, the index resides in memory and is updated in real time after each insert/update/delete operation. To persist it, it gets serialized (e.g. as JSON object) and stored to disk, which works reasonably well for index sizes up to a few 100k entries. However, as the database size grows the index loading times at program startup become problematic, and committing changes in realtime to disk becomes nearly impossible since writing of the index incurs a large overhead.
Hence I'm looking for an implementation of a persistent index which allows for efficient incremental updates, or, expressed differently, does not require rewriting the whole index when persisting it to disk. What would be a suitable strategy for approaching this problem? I thought about using a linked-list to implement an addressable storage space to which objects could be written but I'm not sure if this is the right approach.