0

I am writing a webapp thats runs on AWS. My app requires users to upload their pdf files. I will convert them into Images using the "convert" utility in linux. Here is my setup on Ubuntu 12.04:

  • Django
  • Celery
  • Django Celery
  • Boto

I am using apache as my webserver.

The work flow is as follows: Three are three asynchronous tasks and two queues for handling all the processing and S3 for storing input and Output files. A user uploads a pdf then:

  1. accept_file_task is called: This task takes the user uploaded pdf and stores it in my S3 storage and then inserts a message into the input_queue(SQS)

  2. check_queue_and_launch_instance_task: A periodic task that keeps monitoring the number of messages in the input_queue and launches instances whenever the queue has more messages than the no of Ec2 instances

  3. The instances have a bootstrap script which is a while True: loop. Any of the instances can pick the message from the input_queue and do a Subprocess.Popen("convert "+input+ouput) and write the processed stated to output_queue and also upload the image generated into S3 output bucket and make it available as a download link

  4. output_process_task: another periodic task that keeps polling the output_queue and whenever a message is available it will update the status in the table mentioned below.

I am using a model called Document to store all the status information. I also have users registering and hence a table to store all user information. Also Celery created a lot of tables to store all its task information. Right now I am using a single instance and the sqlite3 database (that comes with python) on that instance.

I am unsure about the following things

  1. How do I scale up the database? Should I go for a RDS or a simpleDB or AmazonDB. If not celery, I could have easily used simpleDB. I am really stuck on this one

  2. How do I get rid of the two periodic tasks check_queue_and_launch_instance_task and output_process_task. My idea is that Autoscaling must be used in some way so that if need at a later stage an Elastic Load Balancer can be used.

If any of you have designed something similar please help me on how to go about it

Zulu
  • 8,765
  • 9
  • 49
  • 56
rajatk
  • 1
  • 3

1 Answers1

0

How do I scale up the database? Should I go for a RDS or a simpleDB or AmazonDB. If not celery, I could have easily used simpleDB. I am really stuck on this one

Keep in mind that premature optimization is the root of all evil. The question of RDS (which is really just MySQL, Oracle, or MS SQL) vs. SimpleDB is more of an application design decision than one based on scalability. SimpleDB is just a simple key-value store. RDS, on the other hand, will give you full ACID functionality. If your data is relational, then you should be using a relational database. If you just need a place to store simple strings or integers, then something like SimpleDB would make more sense.

Right now I am using a single instance and the sqlite3 database (that comes with python) on that instance.

Make sure that you understand the consequences of a) creating a single point-of-failure in your design and b) SQLite's limitations compared to using a standalone RDBMS in this application. (You can use it, but it's really intended for single-user applications).

How do I get rid of the two periodic tasks check_queue_and_launch_instance_task and output_process_task. My idea is that Autoscaling must be used in some way so that if need at a later stage an Elastic Load Balancer can be used.

If you're willing to replace Celery with SQS, you can tie together SQS + SNS + Cloudwatch to simplify this portion of your app. Though what you're doing doesn't sound like a bad choice, especially if it's working well already. Your time is probably better spent working on the problems in front of you rather than those that might occur down the road.

jamieb
  • 9,847
  • 14
  • 48
  • 63
  • Thanks for the reply. I was under the impression that celery is needed for the asynchronous tasks. Can you please point me to a reference where SQS+SNS+Cloudwatch can replace celery. I could literally use SimpleDb, since my database requirements are not all that relational. It saves me from all the snapshot creation etc, where as with RDS I might need do to them. – rajatk Jan 25 '13 at 16:22
  • You can actually use [Celery on-top of SQS](http://docs.celeryproject.org/en/latest/getting-started/brokers/sqs.html). Does that help you at all? – jamieb Jan 25 '13 at 18:03
  • Thats what I am already doing. But with periodic tasks I am worried about autoscaling – rajatk Jan 25 '13 at 18:24