The problem that prompts this question is related to creating ginormous inverted indices similar to those used when building IR systems. The general mantra from the IR community is that relational database are not suitable for building IR systems. Anyhow when looking at posgres the row tuple overhead is 23-bytes + padding (See "How much database disk space is required to store data from a typical text file?" in the Postgres FAQ). This is prohibitively large (non-scaling) for my work.
Incidentally my data set is 17 gigs of text requiring 4-5 tables depending on how the problem is sliced. I remember trying out a schema in sqlite and the db file breaking 100 gigs
I'm quite interested in knowing what the per row overheads are for Sql Server / MySql / Sqlite / Berkeley db (all of its access methods) / Berkley Db's sqlite3 interface / Kyoto , Tokyo db and Firebird. Any one person won't be able to answer the question I guess, unless someone has been as curious as me to look into the matter.
edit
- Postgres - 23(OMG !) byte tuple header + padding.
- bdb-hash : 26-byte page overhead, 6-byte key/data overhead (combined).
- Bdb-btree : 26-byte page overhead, 10-byte key/data overhead (combined).
- MySql Innodb : analysed here (5-byte header + transaction id + roll pointer =18 per row afaik) note-to-self: why does a transaction id appear on disk ? What are roll pointers ?
- Sql Server : from here. Captures variant element lengths, rows with static data types incur a very modest overhead. overhead estimation largely depends on the nature of the schema and the data. Overhead increases the larger a variant element is.