I have a large, sparse, multidimensional lookup table, where cells contain arrays varying in size from 34 kB to circa 10 MB (essentially one or more elements stored in this bin/bucket/cell). My prototype has dimensions of 30**5=24,300,000, of which only 4,568 cells are non-empty (so it's sparse). Prototype non-empty cells contain structured arrays with sizes between 34 kB and 7.5 MB. At 556 MB, the prototype is easily small enough to fit in memory, but the production version will be a lot larger; maybe 100–1000 times (it is hard to estimate). This growth will be mostly due to increased dimensions, rather than due to the data contained in individual cells. My typical use case is write once (or rarely), read often.
- I'm currently using a Python dictionary, where the keys are tuples, i.e.
db[(29,27,29,29,16)]
is a structurednumpy.ndarray
of around 1 MB. However, as it grows, this won't fit in memory. - A natural and easy to implement extension would be the Python
shelve
module. - I think
tables
is fast, in particular for the write once, read often use case, but I don't think it fits my data structure. - Considering that I will always need access only by the tuple index, a very simple way to store it would be to have a directory with some thousands of files with names like
entry-29-27-29-29-16
, which then stores thenumpy.ndarray
object in some format (NetCDF, HDF5, npy...). - I'm not sure if a classical database would work, considering that the size of the entries varies considerably.
What is a way to store a data structure as described above, that has efficient storage and a fast retrieval of data?