4

so I have a docker image that runs a celery worker via supervisor and works just fine on single-docker Elastic Beanstalk (pretty long tasks, so acks late = true, concurrency = 1 and prefetch multiplier = 1).

The trouble is I would like to scale instances depending on the effective task load of the workers, while EB only allows overall network and CPU load.

Adding a rule to scale up on CPU load works ok, but I have no guarantee that EB wouldn't decide to scale down whilst in the middle of a task. That would trigger a docker stop and effectively kill any running celery that is not able to complete quite shortly (10s if I'm not mistaken).

Ideally I would need a monitor based on both CPU activity and tasks in the queue, in pseudo code something like:

while check interval has passed if task queue is empty (or workers are not busy) if running instances is greater than 1 scale down 1 instance else if CPU load is higher than threshold scale up 1 instance

now the problem is that this level of logic doesn't seem achievable in EB, more likely to run on ECS but I am not sure of the following:

  • what kind of celery monitor should we implement and where should the code run? e.g. the usual celery worker monitor commands don't seem to be a good solution to monitor how busy a worker is and we need to handle the additional complexity of running a worker in docker
  • where should the cluster scaling function run? after a chat with an AWS engineer, seems that AWS Lambda could be a potential solution but hooking cluster instances load reports to lambda snippets seems quite complicated and hard to maintain
  • as a bonus question, in case we need to migrate to ECS, we will also need to rewrite our deploy scripts to manually trigger a version swap, as this is currently managed by EB. what would be the best way to do so?

any help appreciated, thanks!

gru
  • 4,573
  • 1
  • 22
  • 22

1 Answers1

5

You can take advantage of AWS elastic beanstalk service where you just need to provide docker image. It also comes with the dashboard where you can provide your environment variable or scale your application/worker based on CPU/requests/memory etc.

You already made docker image of your celery worker. So, instead of scaling on CPU or memory. You can scale your instance based on a number of tasks in the queue. You can put scaling constraint yourself.

following are different ways you can count celery tasks inside the queue. Pick up your option to put a watch on the task queue.

Using pika:

import pika

pika_conn_params = pika.ConnectionParameters(
    host='localhost', port=5672,
    credentials=pika.credentials.PlainCredentials('guest', 'guest'),
)
connection = pika.BlockingConnection(pika_conn_params)
channel = connection.channel()
queue = channel.queue_declare(
    queue="your_queue", durable=True,
    exclusive=False, auto_delete=False
)

print(queue.method.message_count)

Using PyRabbit:

from pyrabbit.api import Client
cl = Client('localhost:55672', 'guest', 'guest')
cl.get_messages('example_vhost', 'example_queue')[0]['message_count']

Using HTTP

Syntax:

curl -i -u user:password http://localhost:15672/api/queues/vhost/queue

Example:

curl -i -u guest:guest http://localhost:15672/api/queues/%2f/celery 

Note: Default vhost is / which needs to be escaped as %2f

Using CLI:

$ sudo rabbitmqctl list_queues | grep 'my_queue'

Now, Based on the option you choose(CLI or Python), you can apply following solution to scale your celery worker. The common thing between the two logic is both need to run continuously.

If you choose CLI. You can create a script which will monitor the number of tasks continuously and apply scaling logic if it is more than the provided limit. If you are using the kubernetes that it would be very easy to scale your Deployments. Otherwise, You will need to follow approach your system required.

With python, You can follow the same path but the only advantage is it can work as service in future if logic gets too complicated in future.

Aniket patel
  • 551
  • 1
  • 5
  • 17
  • hey, thanks for this answer, it's great content and I hope it's beneficial to anyone landing to this question. Unfortunately, it has arrived a couple of years after I had the problem, so I find it really hard now to say whether it would have addressed the issue or not. – gru Nov 14 '19 at 14:18