SQL Database design for huge datasets

Question

I have a customer that has the following data structure... for each patient, there may be multiple samples, and each sample may, after processing, have 4 million data objects. The max number of samples per patient is 20. So a single patient may end up with 80 million rows of data, and of course there will be many many hundreds of patients eventually.

In setting up a database to store the objects (which each contain about 30 fields of statistics and measurements) the challenge is pretty clear- how to manage this vast amount of data?

I was thinking that I would have one database, with a table for each sample- so each table may have at most 4 million records.

A colleague of mine had an interesting suggestion which was to take it one step further- create a new database per patient and then have a table per sample. His thinking was that having 1 log per patient, being able to move databases on a per patient basis, etc was good. I can't disagree with him.

Is this reasonable? Is it a bad idea for some reason to have many databases?

Thoughts? Thank you!

What type of searching/querying are you going to be doing on this data? Relational database engines are built to handle millions and millions of rows in a single table... but if you are going to be pulling back 4 million rows every time you access the database, you might want to use an RDBMS... maybe something document oriented like MongoDB or Apache Cassandra? — bobwienholt, Sep 24 '11 at 22:55
@bobweinholt So the only thing I need to do with this data is run statistics on it. Most stats will be simple (mean, mode, stddev, etc.) some may be more advanced. So I was thinking SQL CLR or T-SQL. But I will -never- need to pull ALL the data back out. Its just to store for calculations. — Nicros, Sep 24 '11 at 23:02

score 2 · Accepted Answer · answered Sep 24 '11 at 22:52

2

While the idea is interesting from privacy and migration standpoint, it is NOT a good idea to have a single database per patient. Think about managing, backing up, having files for each patient database. I'm even not sure if DBMS can handle millions of databases at the same time in an instance or a server.

What I would do is, accept the volumetric data as facts of live and deal with it in the type of parameters and tables you choose. Let the DBMS worry about the schale of it. Make sure you have a deployment model allowing to scale-up and scale-out your tables. A table per entity, at least would be wise, so for patient, measurement, etc.

Just, do what you are good in as a developer and let the DBMS do what it is created for.

answered Sep 24 '11 at 22:52

kroonwijk

8,340
3
31
52

I don't think there would be millions of databases- just one per patient, and I think the patient load would be in the thousands at most. Still a lot tho. I think my colleague was thinking that managing those thousands of tables and a single massive log file would be more expensive for maintenance and moving than self-contained databases. – Nicros Sep 24 '11 at 23:08
Number of patients depends on the institution you want to serve. I was referring to university hospital like numbers. – kroonwijk Sep 24 '11 at 23:27

score 1 · Answer 2 · answered Sep 24 '11 at 22:54

1

When working with that much data, you will definitely want to explore MySQL and RDBMS alternatives. Have you looked into any noSQL solutions? (i.e. key value stores). There are several open source solutions, some of which would immediately not be right for this application given that any data loss is probably unacceptable.

Perhaps try looking at Apache's Cassandra http://cassandra.apache.org/. Its a distributed database system (key-value store), but can run on a single node as well. It would allow you to store all of your data for each patient under a single key value "i.e. Patient1" and then from there you could organize your data into whatever key-value structure is best for querying in your application.

answered Sep 24 '11 at 22:54

DJSunny

1,970
3
19
27

I will absolutely check out cassandra, sounds interesting! Oh, but can I run stats on columns of data? SQL has some of these things built in as I mentioned, and I will need them. Can cassandra do that? – Nicros Sep 24 '11 at 23:09
1

@Nicros Cassandra is a different animal from MySQL. You will have to change the way you think about the data itself, in terms of storing and querying. Its a little weird at first if you're very used to RDBMS, but it becomes obvious how powerful it can be. Here's a really great article that explains how Cassandras data model works point by point. You may have to read it more than once :). http://arin.s3.amazonaws.com/pub/docs/WTF-is-a-SuperColumn.pdf – DJSunny Sep 25 '11 at 16:17

SQL Database design for huge datasets

2 Answers2