Deploying a (single node) Django Web application with virtually zero downtime on EC2

Question

Question: What are good strategies for achieving 0 (or as close as possible to 0) downtime when using Django?

Most of the answer I read say "use south" or "use fabric", but those are very vague answer IMHO. I actually use both, and am still wondering how to achieve zero downtime as much as possible.

Some details:

I have a decently sized Django application that I host at EC2. I use South for schema and data migrations as well as fabric with boto for automating repetitive deployment/backup tasks that get triggered through a set of Jenkins (continuous integration server) tasks. The database I use is a standard PostgreSQL 9.0 instance.

I have a...

staging server that gets constantly edited by our team with all the new content and gets loaded with latest and greatest code and a...
live server that keeps changing with user accounts and user data - all recorded in PostgreSQL.

Current deployment strategy:

When deploying new code and content, two EC2 snapshots of both servers (live and staging) are created. The live is switched to an "Updating new content" page...

Downtime begins.

The live-clone server gets migrated to the same schema version as staging server (using south). A dump of only the tables and sequences that I want preserved from live gets created (particularly, the user accounts along with their data). Once this is done, the dump gets uploaded to the staging-clone server. The tables that were preserved from live are truncated and the data gets inserted. As the data in my live server grows, this time obviously keeps increasing.

Once the load is complete the elastic ips of the live server gets changed to the staging-clone (and thus it has been promoted to be the new live). The live instance and the live-clone instance get terminated.

Downtime ends.

Yes this works, but as data grows, my "virtual" zero downtime gets further and further away. Of course, something that has crossed my mind is to somehow leverage replication and to start looking into PostgreSQL replication and "eventually consistent" approaches. I know there is some magic I could do perhaps with load balancers, but the issue of accounts created in the meantime make it tricky.

What would you recommend I look at?

Update:

I have a typical Django single node application. I was hoping for a solution that would go more in depth with django specific issues. For example, the idea of using Django's support for multiple databases with custom routers alongside replication has crossed my mind. There are issues related to that which I hope answer would touch upon.

score 4 · Answer 1 · answered May 23 '12 at 06:55

4

What might be interested to look at is a technique called Canary Releasing. I saw a great presentation of Jez Humble last year at a software conference in Amsterdam; it was about low risk releases, the slides are here.

The idea is to not switch all systems at once, but to send a small set of users to the new version. Only when all performance metrics of the new systems are like expected, the others are switched over as well. I know that this technique is also used by big sites like facebook.

answered May 23 '12 at 06:55

Wesley

2,204
15
14

1

Thanks Wesley. A canary release sounds very much like what is described in the Lean Startup book. In my case, I only have two nodes (a live and a staging server), so it doesn't make sense for me to revert back. Even if I did, I would have to think about doing reverse schema/data migrations which are not trivial in certain cases. I will look at the links and thank you for them! – rburhum May 23 '12 at 07:06
1

I never used canary releases myself, but I think the idea is to do only database migrations that add tables/columns but never alter/delete something. That way, all versions can run on the same, latest, database schema. I hope the sheets gain you some new inspiration! – Wesley May 23 '12 at 07:13
I went over the slides, and they are very good. Thank you for them. Nevertheless, they don't address any of the issues that are Django specific. For example, I have users (django-auth) with a whole bunch of related data. Then I have other models. All in the same single node... canary releases don't apply to me unless I start load balancing between multiple nodes, which I guess I could do but I would love more information about django specific patterns. Regardless, thank you for the link! – rburhum May 23 '12 at 16:29

score 2 · Answer 2 · answered May 25 '12 at 23:10

The live server should not get migrated. That server should be accessible from two staging servers, server0 and server1. Initially, server0 is live, and changes are made to server1. When you want to change software, switch live servers. As to new content, that should not be on the staging server. That should be on the live server. Add a column to your tables with a version number for the content tables, and modify your code base to use the correct version number of content. Develop software to copy old versions to new rows with updated version numbers as needed. Put the current version number in your settings.py on server0 and server1, so you have a central place for software to refer to when selecting data, or create a database access app that can be updated to get correct versions of content. Of course, for template files those can be on each server and will be appropriate.

This approach will eliminate any downtime. You will have to rewrite some of your software, but if you find a common access method, such as a database access method that you can modify, you might find it is not that much work. The up front investment in creating a system that specifically supports instant switching of systems will be much less work in the long term, and will be scalable to any content size.

score 1 · Answer 3 · answered May 23 '12 at 16:30

1

If I understand correctly, the problem seems to be that your application is down, while the data are being restored to a new database along with the schema.

Why do you create a new server in the first place? Why not migrate the database in-place (of course, after you have extensively tested the migrations) and, once this is done, update the code and "restart" your processes (gunicorn, for instance, can accept the HUP signal that will make it reload the application without dropping connections in the queue).

Many migrations will not have to lock the database tables at all, so this is safe. For the rest, there are other ways to do it. For instance, if you want to add a new column that has to be populated with correct data first, you can do that in the following steps (briefly described):

Add the column as accepting NULL values and make django start writing to that column, so that new entries will have the correct data.
Populate the exsiting entries.
Make django start reading from the new column, too.

answered May 23 '12 at 16:30

mpessas

51
1

The node that has all the new content (the staging server) also has **a whole bunch** of new and updated media files. To solve the media file problems I could just rsync (without deleting old files) finish deployment, then rsync with delete to remove unnecessary files (it is a bit more complicated than that because I deploy with cloudfront, but in theory this should work). The problem is that if I truncate and load the new set of content on the live server, people will see errors if they try to use the site while the new content is loaded. – rburhum May 23 '12 at 16:38
If I had a django simple model with database stamps, I coud do a custom low budget replication sync of that model. But I have 30 models with complex relationships among them. I could roll out a new version of all my models with versioning attached to them, but this is far from trivial and "there's gotta be an easier way"(tm) – rburhum May 23 '12 at 16:42
Do you mean static files (in django, media files are those uploaded by the user)? There is django-staticfiles for that (check the CachedFileStorage or something). It will do what you want. But, TBH I cannot understand what you want to achive with versioning. – mpessas May 23 '12 at 18:10
Syncing static files (for the media folder)is trivial with rsync. In addition, I do use a custom storage backend for S3, so that is not the issue. The issue is that a django models (plus related data) are different in both versions of the webapp (the live one vs the one that is about to be deployed). Deployment of code and content, while being synced, causes down time. Patterns for reducing that down time is the question. – rburhum May 23 '12 at 19:04

score 0 · Answer 4 · answered Jun 01 '12 at 10:02

0

To achieve 0 downtime you must have at least 2 servers + balancer. And update them sequentially. If you want update both - database and application - and have 0 downtime - you must have 2 db servers. No miracles, no silver bullet, and django will not you get out from deployment issues.

answered Jun 01 '12 at 10:02

Nikolay Fominyh

8,946
8
66
102

which of the nodes has the user accounts? – rburhum Jun 01 '12 at 16:23
Both application nodes have access to user accounts. Or what do you mean? – Nikolay Fominyh Jun 02 '12 at 15:03
"have access to user accounts" do you mean there is a third node with a database by itself? So your solution actually requires three nodes. If so, that is OK. But what happens when one of the aplication nodes needs a south data migration to be applied to run? – rburhum Jun 02 '12 at 22:08
Yes, my solutions always requires at least one dedicated database server. IF south migration takes a lot of time to run(for example you have update on table with 10M rows in it) - then you have to replicate database on at least 2 nodes. By the way, I don't have enough experience here. I know, that instagram used EC2 snapshots for this purpose, and for postgresql traditional way - is to use solution like slony. – Nikolay Fominyh Jun 03 '12 at 21:05
But then there is no point of having two EC2 application instances if only one can run at a time with a particular (south) schema version, no? I do agree with you that the solution is to have two instances and do a switch, but my original question is still not being addressed (e.g how to deal with django-auth objects in this scenario). – rburhum Jun 03 '12 at 21:29
What problem with django-auth objects? Where you store session? – Nikolay Fominyh Jun 04 '12 at 06:33
an account gets created. where do you store it? – rburhum Jun 08 '12 at 02:55
When we creating user account - we store it in db. – Nikolay Fominyh Jun 12 '12 at 08:51
so one is a master, the other one is a slave. when you update the master with a migration that takes a long time, the slave can only provide read access, but not write functionality. The same is true for all other objects that require writing. For any non-trivial app, that may equate to down time (the web app needs certain objects in the db to writable in order not to be down) – rburhum Jun 12 '12 at 14:03
if migration takes a lot of time - then it must run separately. find way to hot swap master on running application. – Nikolay Fominyh Jun 12 '12 at 15:10

Deploying a (single node) Django Web application with virtually zero downtime on EC2

4 Answers4

Linked