4

I plan on having a SQL fact table involving a text field which I don't expect to index on (I will only read out the data and very rarely update it). I think this table could get quite large, primarily due to this text field. The rest of the data in my database does make sense to be relational, however I believe I could scale much more easily and cheaply if I instead store pointers to flat files (where each pointer is to a different text file stored in something like S3) instead of using the text field.

An alternative that seems to be gaining popularity is a fully NoSQL document-based solution (e.g. CouchDB, MongoDB, etc.) I am wondering what are the tradeoffs (scalability/reliability/security/performance/ease of implementation/ease of maintenance/cost) between simply using a SQL text field, having a pointer to flat files, or completely rethinking the entire system in the context of a NoSQL document store?

Community
  • 1
  • 1
user1080972
  • 41
  • 1
  • 2
  • 2
    This is a very complicated question to answer. The notion of *"quite large"* is very vague. Are you talking about Terabytes of data or Petabytes of data? What's the growth rate? What queries need to be fast, what can be acceptably slow? – Gates VP Dec 05 '11 at 08:45
  • This specific text data is expected to be on the order of 50 TB. It is expected to grow around the order of 500 kb per second during peak load. Ideally all select statements are fast (they will be predefined as only a web service will have access to the database), while insert and updates may be acceptably slow. – user1080972 Dec 05 '11 at 09:19
  • If you want to use MongoDB on a 32-bit system, the first thing to consider is that you can only store 2GB of data. The MongoDB producers say that the problem will be solved soon as most PCs will be 64bit, so they do not want to change their program to allow 32bit PCs to work with more than 2GB. At least that’s what I read. So that’s a first problem, but I think CouchDB does not have this problem. – aufziehvogel Jun 19 '12 at 19:54
  • MongoDB 32bit system support is only for dev. Production systems always run on 64bit systems, which has been standard for a long time. The reason behind the requirement is because MongoDB has been designed to utilize memory-mapped files. – Dylan Tong Jul 25 '13 at 00:10

1 Answers1

9

The best approach is to use a relational db for the normal (non-text) data and save the large (text) data "somewhere else" that can handle large data better than a relational database can.

First, let's discuss why it's a bad idea to save large data in a relational database:'

  • row sizes become much longer, so the I/O required to read in disk pages with target rows balloons
  • backup sizes and, more importantly, backup times enlarge to the point they can cripple DBA tasks and even bring systems offline (then backups are turned off, then the disk fails, oops)
  • you typically don't need to search the text, so there's no need in having it in the database
  • relational databases and libraries/drivers typically aren't good at handling unusually large data, and the way of handling it is often vendor-specific, making any solution non-portable

Your choice of "somewhere else" is broad, but includes:

  • large data storage software like Cassandra, MongoDB, etc
  • NoSQL databases like Lucene
  • File System

Do what's easiest that will work - they are all valid as long as you do your requirements calculations for:

  • peak write performance
  • peak read performance
  • long-term storage volume

Another tip: Don't store anything about the text in the relational database. Instead, name/index the text using the id of the relational database row. That way, if you change your implementation, you don't have to re-jig your data model.

Bohemian
  • 412,405
  • 93
  • 575
  • 722