6

I need an on-disk key-value store, not too big or distributed. The use case is as follows:

  • The full DB will be few Gbs in size
  • Both key and value are of constant size
  • Its a constant data base. Once the entire database is written I don't need to write any more entries (or write very infrequently)
  • Keys will be accessed in unpredictable order
  • Supporting concurrent reads by multiple processes is a must.
  • Have to be very fast because the readers will be accessing millions of keys in a tight loop. So it should be as close as possible to being as performant as looping over an associative array (STL's std::map say)
  • Ideally it should allow one to set how much RAM to use, typically it should use a few hundreds of Mbs
  • Written in C or C++. An existing python extension will be a big plus, but iI can add that on my own

So cdb and gdbm look like good choices, but just wanted to know if there are more suitable choices. Pointers to relevant benchmarks or even relevant anecdotal evidence will be appreciated.

san
  • 4,144
  • 6
  • 32
  • 50
  • Hi whoever suggested edits, I somehow lost them while trying to accept them. If there is a way to recover your changes, please let me know. – san Jun 22 '12 at 03:00

3 Answers3

4

What database did you end up using?

If you like cdb and you need > 4 GB database, please have a look at mcdb, which is originally based on cdb, plus some performance enhancements and the addition of support for 4 GB+ constant databases.

https://github.com/gstrauss/mcdb/

Python, Perl, Lua, and Ruby extensions are provided. mcdb is written in C and uses mmap under the hood and so easily supports lock-free concurrent reads between threads and between processes. Since it is backed by a memory-mapped file, pages are mapped in from disk as needed and memory is effectively constant even as the number of processes accessing the database increases.

gstrauss
  • 2,091
  • 1
  • 12
  • 16
0

Have you looked at bdb? It sounds like a good use of BDB.

hsanders
  • 1,913
  • 12
  • 22
  • yes. I meant gdbm to stand for the dbm and bdb implementations – san Jun 22 '12 at 03:02
  • Ah, Ok. I figured you just meant gdbm by that. There's of course the problem wit gdbm that it's GPL, so it could cause your software to be GPL. – hsanders Jun 22 '12 at 14:15
  • My bad should have made that clear. GPL isnt a problem for me at all. If they are giving something for free, which they arent obligated to, they are entitled to the terms they want. – san Jun 22 '12 at 17:58
  • I don't really have a problem with it, in fact I like what they do, it just CAN be a problem if your program isn't intended to be GPL or compatible. The powers that be sometimes do not like that. – hsanders Jun 22 '12 at 18:23
0

i like hamsterdb because i wrote it :)

http://www.hamsterdb.com

  • frequently used with database sizes of several GBs
  • keys/values can have any size you wish
  • random access + directional access (with cursors)
  • concurrent reads: hamsterdb is thread safe, but not yet concurrent. i'm working on this.
  • if your cache is big enough then access will be very fast (you can specify the cache size)
  • written in c++
  • python extension is available, but terribly outdated; will need fixes

if you want to evaluate hamsterdb and need some help then feel free to drop me a mail.

cruppstahl
  • 2,447
  • 1
  • 19
  • 25
  • 1
    looks interesting. Have you benchmarked it against the db, dbm family for read speeds. I dont need multi-threading though, only the ability to open it for reads from multiple process. The latter is mandatory. – san Jun 22 '12 at 18:04
  • 1
    i have benchmarked against berkeleydb, and in most of the benchmarks hamsterdb is much faster and scales better. (in a few others bdb is a bit faster). I have not tested against gdbm. Btw - thread safety is for multiple threads, not multiple processes. hamsterdb locks the file exclusively, and there's NO single writer/multiple reader support. For this i have an embedded http-based client/server model. Sorry - i missed that requirement when reading your original post. – cruppstahl Jun 22 '12 at 21:42
  • Oh then its bit of a bummer because I cannot afford the http overhead. Thanks for getting back. – san Jun 23 '12 at 00:02
  • in that case i think only Berkeleydb offers what you want. – cruppstahl Jun 27 '12 at 20:36