AWS and Python threading scalability

Question

I have a service running on a local server, written using Python threading library. Think of it as a kind of web crawler. It uses 50 threads. I want deploy it on Amazon Web Services cloud and scale it up, so it uses more threads.

Simply, I have two queues: Qinput with URLs and Qoutput with pages content. The threads pick URLs from Qinput, fetch content of the web page an put it to Qoutput

Question: is it enough that I simply increase the number of threads to, say, 500, 5,000 or 50,000 and AWS + Python will handle it? Should I expect the service to run seamlessly or there are some "standard" design pitfalls that I should be aware of when porting a multithreading service on AWS?

I am aware of Global Interpreter Lock although it should not be an issue here, as the main task of the threads is to call outside the interpreter while crawling / scraping pages

You can see what http://stackoverflow.com/questions/12996254/what-are-the-advantages-of-multithreaded-programming-in-python And how I understand increase of threads has no direct relation to performance. — Denis, Jan 09 '13 at 09:50

score 3 · Accepted Answer · answered Jan 09 '13 at 12:14

Any single instance has its limit. You will probably be able to spawn quite a lot of threads in your instance, especially if you choose the larger ones. But you will get diminished return on the additional threads, until it will not help you any more to get more performance.

However, if you want your system to scale beyond the limitation of a single instance, it is best to be able to run your system on multiple instances. Then your decisions is only operational and not technical. I think that if you are running in AWS environment, which allows you almost endless operational resources, you should think into it.

You can also check out SQS, which is basically a distributed queue system. It will allow you to synchronize the work of as many instances as you need.

AWS and Python threading scalability

1 Answers1