2

I have a large, sparse, multidimensional lookup table, where cells contain arrays varying in size from 34 kB to circa 10 MB (essentially one or more elements stored in this bin/bucket/cell). My prototype has dimensions of 30**5=24,300,000, of which only 4,568 cells are non-empty (so it's sparse). Prototype non-empty cells contain structured arrays with sizes between 34 kB and 7.5 MB. At 556 MB, the prototype is easily small enough to fit in memory, but the production version will be a lot larger; maybe 100–1000 times (it is hard to estimate). This growth will be mostly due to increased dimensions, rather than due to the data contained in individual cells. My typical use case is write once (or rarely), read often.

  • I'm currently using a Python dictionary, where the keys are tuples, i.e. db[(29,27,29,29,16)] is a structured numpy.ndarray of around 1 MB. However, as it grows, this won't fit in memory.
  • A natural and easy to implement extension would be the Python shelve module.
  • I think tables is fast, in particular for the write once, read often use case, but I don't think it fits my data structure.
  • Considering that I will always need access only by the tuple index, a very simple way to store it would be to have a directory with some thousands of files with names like entry-29-27-29-29-16, which then stores the numpy.ndarray object in some format (NetCDF, HDF5, npy...).
  • I'm not sure if a classical database would work, considering that the size of the entries varies considerably.

What is a way to store a data structure as described above, that has efficient storage and a fast retrieval of data?

gerrit
  • 24,025
  • 17
  • 97
  • 170

1 Answers1

0

From what I understand, you might want to look at the amazing pandas package, as it has a specific facility for the sparse data structure you've described.

Also, while this stackoverflow post doesn't specifically address sparse data, it's a great description of using pandas for BIG data, which may be of interest.

Best of luck!

Community
  • 1
  • 1
Andy Kubiak
  • 169
  • 6