Database scale considerations

Question

I've completed an application and have been researching hosting environments for it's deployment. The application is fairly heavy on queries, most pages of my application have several queries with several joins as well as triggers on most the tables. As long as the database has enough RAM for it's buffer pool I'm guessing performance should be fine, so if I go with a VPS host like Linode I can just keep upgrading my server so the database has enough RAM. My concern is what happens when I can't get more RAM, how much does performance suffer when a database doesn't have enough RAM? Should I look at the decreasing free memory available like it were a ticking time bomb? Does the DBMS change it's caching techniques to avoid disk accesses whenever possible? Essentially I want to know how smart DBMS's are and how they cope before sharding or replication is used.

score 3 · Answer 1 · answered Jul 04 '12 at 05:25

Programs, in general, are exactly as smart as they are programmed. DBMSes are programs. So without knowing what DBMS you're using, it's impossible to say, in general, what will happen. So, the only correct answer to your question is a close vote as "not a real question" (which I note someone has already done). However, I've got some spare time, so I'll write up a general screed on database scaling and performance, in the hope it might answer the question you should be asking.

Since you're using the not-really-hip-any-more term "DBMS", I'll assume you're using a not-really-hip-any-more relational database, though, and there things get more complicated. The engines I'm familiar with (MySQL and PostgreSQL) both have a zillion knobs to tell the system how much RAM to use -- caches of various things, working set memory, buffers... it's all a lot of fun. Tuning them appropriately for the workload and available system resources is mostly (though not entirely) about reducing disk I/O, since that is usually (though, again, not always) the slowest and most-likely-to-cause-a-bottleneck component in the physical system.

Thus, when you are unable to increase RAM any further, your performance will start to suffer (hopefully gradually) as more queries require more disk accesses to complete. The performance degradation with database size increase will be exacerbated by how poor the disk I/O performance is.

Given how hard it is to horizontally scale a relational database (it's not impossible, but it's a whole hell of a lot harder than horizontally scaling frontends), if you're intending on doing things at scale, you need a provider that can give you big machines -- lots of RAM, but also lots of CPU, disk space and IOPS. Linode's largest VM appears to be 20GB, which is way too small. AWS has instances with up to 70GB or so of RAM, which is better, but when you can get a physical machine with a TB (or more) of RAM... it's still not real clever.

It's not that a VM is always wrong for a database server, but at some point, when you outgrow the available VM options, you need to know what you're going to do next. It's increasingly common for people to go down the "shard early, shard often" path, because if you're going for massive scale there isn't a physical machine on earth that'll save you, and it means you can run on whatever dinky-toy cloud you like. Sharding is a lot of work to do right, though, and limits your options somewhat in the way you model and interact with your data, so I like to avoid it if I can. The thing is, physical hardware moves at a fairly steady clip, and there's a lot of headroom already available for you to grow into, so by the time you've got a database that needs 2TB of RAM and 30TB of storage (roughly the largest spec single physical machine I can buy at the moment), technology is likely to have improved to the point where a machine with 4TB of RAM and 100TB of storage costs less than what you paid for that 2TB monster.

(Disclaimer: I work at a hosting provider that does a lot of hybrid VPS/physical setups on behalf of customers of various sizes, and I'm sure this colours my judgment on the subject)

score 3 · Answer 2 · answered Jul 04 '12 at 05:38

Let me add to Womble - and as someone who just finished working on a project with a non trivial database in the size of 21000gb... ...You ahve 2 fundamental issues you need to understand.

RAM is relative. A modern server for a proper database has 256 and more gigabyte. VPS do not even show up as "real database server" in that world.
Disc speed is also relative. I run a system at home you would likely consider to be extremely powerfull - 2 SSD, 8 Velociraptors just for data to get proper IO budgets for the data - but in my world that does not even show up - the last system I worked on had 3 storage nodes each with 768gb flash memory to BUFFER IO and was delivering more data in random IO than you get from your discs sequential.

Basically, RAM can be added a lot more than you think, and then at one point you sit down and design a database SERVER that is IO optimized. Interesting enough one item missing today, where everyone things that virtualization solves all isssues and brings world piece, is that database servers ARE IO bound and this is a solved problem for a part., just expect to get large cass with tons of drives or actually SSD these days. Nothing comes free, but this is a non-avoidable fundamental problem and it is a solved one. This is one reason you can get nice 4U racks from SUperMicro that host 72 dslots for discs. It is one reason SAS was designed. It is one reason SSD are very much liked for databases - they are about 100 times as fast (or more) than hard discs when talking IO per second.

VPS just do not go there ;)

Does the DBMS change it's caching techniques to avoid disk accesses whenever possible?

No, it does not. Because this is the ONLY (!) sensible caching technique TO START WITH. Any proper database in the larger world (SQL Server, DB2, Oracle) try to use memory to avoid IO as much as possible. Read SQL blogs and a lot of not too experienced people always complain SQL Server starts using too much memory - sure, because the memory is there and it tries to cache as much as possible.

THis is also one reason database use transaction logs - it means changes to the database do not have to be written NOW but writing can be delayed, while keeping the updates in the tx log and thus save in case of a crash.

Again, this is a "solved problem". Oracle has hardware that goes there - our 21000gb setup used an Oracel ExaData an it was THE SMALLEST SETUP THEY SELL.

Depends. If you go "cheapo VPS" then no, they wont. Database servers are normally bad candidates for virtualization. My main one has 32gb RAM at the moment, using 8 discs - 2 of them SSD - just for the databases (no os yet). And it is growing - likely hitting 24 discs before end of the year. It is a VPS now - on my own hardware - but it will soon be a dedicated box. VPS do not scale up traditionally. — TomTom, Aug 20 '13 at 04:50

score 1 · Answer 3 · answered Jul 25 '12 at 09:19

Another option that has not been mentioned is database as a service. If the issue is that a single DB instance is running out of RAM, consider using a database service which supports auto scaling of throughput. This type of service will automatically scale the database to multiple nodes, beyond the limit of even the largest machine in terms of RAM, and in this way accommodate additional throughput or connections. I am aware of two services that state they provide auto scaling, Xeround (MySQL) and Enterprise DB (PostgreSQL).

Database scale considerations

3 Answers3