2

I'm building a distributed crawling mechanism and want to make sure that no more than 30 requests are made to the server in one minute. Each enqued task makes a request.

All tasks are enqued in redis and are dequed using api provided by python-rq.

The approach is to set a key in redis that expires every minute, to hold the number of requests sent.

Each time a piece of work is available, check if requests sent < 30 - If no, then just sleep for a minute - If yes, then work


Following is my custom worker :

#!/usr/bin/env python
import sys
import time
from rq import Connection, Worker
from redis import Redis

redis = Redis()

def should_i_work():
    r = redis.get('app:requests_sent_in_last_minute')
    if r == None:
        redis.setex('app:requests_sent_in_last_minute', 1, 60)
    return  r == None or int(r) < 30

def increment_requests():
    r = int(redis.get('app:requests_sent_in_last_minute'))
    redis.set('app:requests_sent_in_last_minute', r+1)

def main(qs):
    with Connection():
        try:
            while True:
                if should_i_work():
                    increment_requests()
                    w = Worker(qs)
                    w.work()
                else:
                    time.sleep(60)
        except KeyboardInterrupt:
            pass

if __name__ == '__main__':
    qs = sys.argv[1:] or ['default']
    main(qs)

This doesn't seem to work as the worker performs tasks despite of the number at its usual speed and also the value of the key being set is not updated beyond 3.

I have a strong feeling that my thought process is flawed. What am I doing wrong here ?

Thanks

Shivek Khurana
  • 2,056
  • 1
  • 19
  • 16

1 Answers1

0

After reviewing worker.py source, the mistake in my thought process was evident. The w.work() function initiates a loop and continuously dequeues tasks.

Because this process cannot be controlled, without re-writing the worker class, and the next best way is to control the enqueuing process. Don't enqueue if more than 30 tasks were added in the last minute.

Here's the solution I came up with : https://gist.github.com/shivekkhurana/7201e5cd2ec9d51af31c8b96eeb8fcf7

Just pass RequestAwareWorker in -w flag.

Shivek Khurana
  • 2,056
  • 1
  • 19
  • 16