0

We have a batch analytical SQL job – run once daily – that reads data from 2 source tables held in a powerful RDBMS. The source tables are huge (>100TB) but has less than 10 fields combined.

The question I have is can the 2 source tables be held in a compressed and indexed flat file so the entire operation can be much faster and saves on storage and can be run on a low spec server. Also, can we run SQL like queries against these compressed and indexed flat-files? Any pointers on how to go about doing this would be extremely helpful.

  • SQLite uses a single portable flat file to store the objects of a single database. It supports indexes and a SQL interface. But the maximum database size allowd is only 14TB. – Raihan Nov 25 '11 at 08:13
  • You could use JSON or YAML based text files to store your tables. The maximum file size can be as large as your OS allows. There are Java/Ruby and many other programming interfaces for both JSON and YAML. But I don't think any of the impementations support indexing yet. And there is no SQL interface. – Raihan Nov 25 '11 at 08:36

1 Answers1

1

Most optimization strategies optimize either speed or size, and trade one off against the other. In general, RDBMS solutions optimize for speed, at the expense of size - for instance, by creating an index, you take up more space, and in return you get faster data access.

So your desire to optimize for both speed AND size is unlikely to be fulfilled - you almost certainly have to trade one against the other.

Secondly, if you want to execute "sql-like" queries, I'm pretty sure that an RDBMS is the best solution - especially with huge data sets.

It may be the case that the underlying data lends itself to a specific optimization - for instance, if you can create a custom indexing scheme based on bitmasks to create integers, and using those integers to access data using boolean operators, you may be able to beat the performance of an RDBMS index.

Neville Kuyt
  • 29,247
  • 1
  • 37
  • 52
  • Thanks Neville. "sql like" queries is to assess if we can re-use our existing SQL jobs against an indexed flat-file but not too important. The main point is how should one go about creating such high performance indexed and compressed flat-files. – user1065024 Nov 25 '11 at 12:16