I am trying to continuously crawl a large amount of information from a site using the REST api they provide. I have following constraints-
- Stay within api limit (5 calls/sec)
- Utilising the full limit (making exactly 5 calls per second, 5*60 calls per minute)
- Each call will be with different parameters (params will be fetched from db or in-memory cache)
- Calls will be made from AWS EC2 (or GAE) and processed data will be stored in AWS RDS/DynamoDB
For now I am just using a scheduled task that runs a python script every minute- and the script makes 10-20 api calls-> processes response-> stores data to DB. I want to scale this procedure (make 5*60= 300 calls per minute) and make it manageable via code (pushing new tasks, pause/resuming them easily, monitoring failures, changing call frequency).
My question is- what are the best available tools to achieve this? Any suggestion/guidance/link is appreciated.
I do know the names of some task queuing frameworks like Celery/RabbitMQ/Redis, but I do not know much about them. However I am wiling to learn one or each of those if these are the best tools to solve my problem, want to hear from SO veterans before jumping in ☺
Also please let me know if there's any other AWS service I should look to use (SQS or AWS Data Pipeline?) to make any step easier.