8

The problem that prompts this question is related to creating ginormous inverted indices similar to those used when building IR systems. The general mantra from the IR community is that relational database are not suitable for building IR systems. Anyhow when looking at posgres the row tuple overhead is 23-bytes + padding (See "How much database disk space is required to store data from a typical text file?" in the Postgres FAQ). This is prohibitively large (non-scaling) for my work.

Incidentally my data set is 17 gigs of text requiring 4-5 tables depending on how the problem is sliced. I remember trying out a schema in sqlite and the db file breaking 100 gigs

I'm quite interested in knowing what the per row overheads are for Sql Server / MySql / Sqlite / Berkeley db (all of its access methods) / Berkley Db's sqlite3 interface / Kyoto , Tokyo db and Firebird. Any one person won't be able to answer the question I guess, unless someone has been as curious as me to look into the matter.

edit

  • Postgres - 23(OMG !) byte tuple header + padding.
  • bdb-hash : 26-byte page overhead, 6-byte key/data overhead (combined).
  • Bdb-btree : 26-byte page overhead, 10-byte key/data overhead (combined).
  • MySql Innodb : analysed here (5-byte header + transaction id + roll pointer =18 per row afaik) note-to-self: why does a transaction id appear on disk ? What are roll pointers ?
  • Sql Server : from here. Captures variant element lengths, rows with static data types incur a very modest overhead. overhead estimation largely depends on the nature of the schema and the data. Overhead increases the larger a variant element is.
Community
  • 1
  • 1
Hassan Syed
  • 20,075
  • 11
  • 87
  • 171
  • Enlighten me - what does IR systems mean? – Doc Brown Jan 02 '12 at 14:24
  • http://en.wikipedia.org/wiki/Information_retrieval – Hassan Syed Jan 02 '12 at 14:24
  • not really.... that's like saying attending a lecture at university is poor attitude (which I think it is, I never attended lectures, but the rest of the world would disagree). This place is for digested information that is spoon fed. Indintally I code 5-6 hours a day, 3-4 hours of research. If I followed every rabit hole I would not get any work done. – Hassan Syed Jan 02 '12 at 14:27
  • No, it's like going to university, and expecting someone else to give you their notes for the term! – Mitch Wheat Jan 02 '12 at 14:28
  • Well I am the sort of person that doesn't get bitter at people taking short cuts. Those are the sort of people that are most productive. Stand on the shoulders of Giants remember ? – Hassan Syed Jan 02 '12 at 14:32
  • The "study and understand" part is conveyed via publication. Publications contain condensed sumarized information, so as not to waste the time of peers and to pre-emptively close avenues of investigation. Thus allowing a non-expert peer to engage with the field without overheads allready incurred by others. Get a clue man. – Hassan Syed Jan 03 '12 at 16:39
  • That's an interesting suggestion coming from someone who claims that (or at least seems to subscribe the notion that) "relational database are not suitable for building Information Retrieval systems". If not for retrieving information, then what else have they ever been used for ? – Erwin Smout Jan 04 '12 at 00:17
  • That's right, go off topic. Either way you have no clue what you are talking about. Any IR researcher will tell you that a standard Enterprise DB is non-scalable for Internet-scale problems. Do some reading before you troll. – Hassan Syed Jan 04 '12 at 11:41
  • The *"Standing on the shoulders of giants"* was part of a remark made by Newton (and previously by some others). Publications at that time might contain condensed summarized information but it was not for the eyes of the non-experts. Just try to read an ancient math paper in Greek. – ypercubeᵀᴹ Jan 04 '12 at 13:06
  • @ypercube yes that is correct. But I didn't really mean a non-expert in that extreme. More along the lines of say a physicist reading pure Maths, or a algorithm designer/modeller looking into database design :). – Hassan Syed Jan 04 '12 at 14:57
  • I found this table is hugely useful, I would actually be interested in answers, and it is a real question. – Joseph Garvin Mar 17 '17 at 21:00

0 Answers0