Storing Hundreds of Millions of Records

Question

My company will be coming into possession of a dataset consisting of approximately 200-300 million records. The source material is csv and is about 150GB uncompressed. We would need to perform an initial load of the data and then updated approximately 1% of the records on a daily basis. We would love to be able to keep the history of each record as well.

We currently use MySQL and it appears that some people are using MySQL and PostgreSQL for databases of this size, but I don't see much hard information about their experiences.

We could definitely get away without normalizing the data and I can envision distributing the information over a lot of servers. How about MongoDB or some other nontraditional datastore?

Does anyone have any thoughts on the feasibility of this sort of endeavor? I appreciate any help you might be able to give.

THe nature of the data, and how the data will be used once it is in the chosen database would be useful information. If you're going to be doing lots of range/aggregate queries and the like, most of the "NoSQL" databases are not good options. Is this going to be a "hot" dataset? Is there a lot of redundant data (a database like MSSQL with row compression could help performance greatly). Add such info to the question and you will get better answers. "X million rows" doesn't mean much in the real world, despite what the database marketing materials of the 1990s told us. — rmalayter, Sep 09 '10 at 04:17

score 1 · Answer 1 · answered Sep 08 '10 at 22:40

My experience in datasets of that size are limited to MSSQL, but it can most definately handle data of that size.

My first concern is the size of the data. 300 million records at 150Gb is about 500Kb per row - and that's a big row. A very, very big row. If you can normalise to 3rd normal form then this could help dramatically (assuming there's data that can be normalised). If you're not going to normalise (and just have a single, massive table), then an engine that supports ISAM is going to be faster than a RDBMS, so MySQL in ISAM mode is the obvious choice over MSSQL (sorry, I don't have any experience with Postgre or Mongo)

That said, MSSQL can handle a table of that size no worries. It can partition the data so that different parts live on different disks, so you could keep you 1% updated data on a fast disk and keep the rest on a slower disk if budget is a concern. If your DBMS of choice supports this then it could be a wise way to go.

Just for reference, I once managed a database that had about 200 million rows in a single table (but the table was only 20Gb in size) and with some smart indexing query times were still being measured in the milliseconds. That was normalised to the 3rd normal form, so there were a lot of LOJ's to retrieve associated data as well.

This is great information. I appreciate your input. – Edwin Knuth Sep 08 '10 at 23:44 — Edwin Knuth, Sep 08 '10 at 23:44

score 0 · Answer 2 · answered Sep 08 '10 at 22:48

Most databases can easily manage the storage of such large amounts, it really depends on what you want to do with the data once you have it loaded. Is it transactional, so it will be queried and updated often? Or is it more for reporting with just new info coming in every day from the transactional system?

Storing Hundreds of Millions of Records

2 Answers2