0

I am trying to continuously crawl a large amount of information from a site using the REST api they provide. I have following constraints-

  1. Stay within api limit (5 calls/sec)
  2. Utilising the full limit (making exactly 5 calls per second, 5*60 calls per minute)
  3. Each call will be with different parameters (params will be fetched from db or in-memory cache)
  4. Calls will be made from AWS EC2 (or GAE) and processed data will be stored in AWS RDS/DynamoDB

For now I am just using a scheduled task that runs a python script every minute- and the script makes 10-20 api calls-> processes response-> stores data to DB. I want to scale this procedure (make 5*60= 300 calls per minute) and make it manageable via code (pushing new tasks, pause/resuming them easily, monitoring failures, changing call frequency).

My question is- what are the best available tools to achieve this? Any suggestion/guidance/link is appreciated.

I do know the names of some task queuing frameworks like Celery/RabbitMQ/Redis, but I do not know much about them. However I am wiling to learn one or each of those if these are the best tools to solve my problem, want to hear from SO veterans before jumping in ☺
Also please let me know if there's any other AWS service I should look to use (SQS or AWS Data Pipeline?) to make any step easier.

AsifM
  • 680
  • 9
  • 21

1 Answers1

1

You needn't add an external dependency just for rate-limiting, as your use case is rather straightforward.

I can think of two options:

  • Modify the script (that currently wakes up every minute and makes 10-20 API calls) to wake up every second and make 5 calls (sequentially or in parallel).
    • In your current design, your API calls might not be properly distributed across 1 minute, i.e. you might be making all your 10-20 calls in the first, say, 20 seconds.
    • If you change that script to run every second, your API call rate will be more balanced.
  • Change your Python script to a long running daemon, and use a Rate Limiter library, such as this. You can configure the latter to make 1 call per x seconds.
ketan vijayvargiya
  • 5,409
  • 1
  • 21
  • 34