When using a DBM database (e.g. Berkeley or GDBM), is it better to store data using fewer long strings or more short strings? I can easily structure my data either way. I'm looking for 'better' in the performance sense, but I'm interested in other implications as well.
-
is there any reasons why you are using dbm instead of better database, such as mysql or etc.? – ghostdog74 Jul 15 '09 at 15:19
-
2they're not 'better', just 'bigger' – Javier Jul 15 '09 at 15:20
-
Optimal string length will depend on your application. Will you be searching the data frequently? Which elements will your search be based on? Which fields will be frequently modified? – Daniel F. Thornton Jul 15 '09 at 15:22
-
2Python allows the use of dbm databases on practically any system, without requiring the user to installing anything. Also mysql is too heavy for what I require. – Jul 15 '09 at 15:24
-
1Lots of accesses quickly, usually reads. Writes are not very often so I don't care about write performance. Since there is only one 'key' in dbm, I can only search over one element... – Jul 15 '09 at 15:26
-
In that case, there will likely not be as great a disparity in performance (between long and short strings), though I would still opt for a greater quantity of shorter elements. – Daniel F. Thornton Jul 15 '09 at 15:30
3 Answers
Berkeley DB, or any other DBM implementation, will incur overhead for each key/value pair. If you're dealing with millions of k/v pairs the overhead will matter, otherwise it's noise and you should choose what is easiest for you the programmer and let the database deal with the data. Overhead and access time will also depend on access method. Hash tables and B-Trees are totally different algorithmic animals. If your data has any degree of key ordering or access patterns dependent on keys then 99% of the time B-Trees are the way to go.
I think you're asking a great design question, but I think for anyone to give you a perfect answer we'd all have to know a lot more about the amount of data your dealing with, access patterns, and many other factors.

- 866
- 9
- 12
If you will be frequently searching or modifying the data, a greater number of short strings will provide better performance.
i.e. You don't want to be searching for a substring of one of those long strings, or modifying some value in the middle of a string frequently.

- 3,687
- 2
- 28
- 41
I think this question is really hard to answer in a completely generic way. There are so many variables here, that you would really need to test some common scenarios to determine the answer that is best for you.
Some factors to consider:
- Will larger strings require substring searches?
- What kind of searches will you perform over the data?
In the end, its generally better to go with the approach that yields the most normalized schema. Optimization can start from there, and depending upon your db, there are probably better alternatives than restructuring the underlying schema purely for performance.

- 27,819
- 25
- 107
- 140