1

I am running into an issue on an ECS cluster including multiple Celery workers when the cluster requires up-scaling.

Some background:

  • I have a task which is running potentially for a few hours.
  • Celery workers on an ECS cluster are currently scaled based on queue depth using Flower. Whenever the queue depth is larger than 1, it scales up a worker to potentially receive more tasks.
  • The broker used is Redis.
  • I have set the worker_prefetch_multiplier to 1, and each worker's concurrency equals 4.

The problem definition: Because of these settings, each of the workers prefetches 4 tasks, before filling the queue depth. So let's say we have a single worker running, it requires 8 tasks to be invoked before the queue depth fills to 1 on the 9th task. 4 tasks will be in the STARTED state and 4 tasks will be in the RECEIVED state. Whenever, scaling up the number of worker nodes to 2, only the 9th task will be send to this worker. However, this means that the 4 tasks in the RECEIVED state are "stuck" behind the 4 tasks in the STARTED state for potentially a few hours, which is undesirable.

Investigated solutions:

  • When searching for a solution one finds in Celery's documentation (https://docs.celeryproject.org/en/stable/userguide/optimizing.html) that the only way to disable prefetching is to use acks_late=True for the tasks. It indeed solves the problem that no tasks are prefetched, but it also causes other problems like replicating tasks on newly scaled worker nodes, which is DEFINITELY not what I want.
  • Also ofter the setting -O fair on the worker is considered to be a solution, but seemingly it still creates tasks in the RECEIVED state.

Currently, I am thinking of a little complex solution to this problem, so I would be very happy to hear other solutions. The current proposed solution is to set the concurrency to -c 2 (instead of -c 4). This would mean that 2 tasks will be prefetched on the first worker node and 2 tasks are started. All other tasks will end up in the queue, requiring a scaling event. Once ECS scaled up to two worker nodes, I will scale the concurrency of the first worker from 2 to 4 releasing the prefetched tasks.

Any ideas/suggestions?

thijsvdp
  • 404
  • 3
  • 16

1 Answers1

1

I have found a solution for this problem (in these posts: https://github.com/celery/celery/issues/6500) with the help of @samdoolin. I will provide the full answer here for people that have the same issue as me.

Solution: The solution provided by @samdoolin is to monkeypatch the can_consume functionality of the Consumer with a functionality to consume a message only when there are less reserved requests than the worker can handle (the worker's concurrency). In my case that would mean that it won't consume requests if there are already 4 requests active. Any request is instead accumulated in the queue, resulting in the expected behavior. Then I can easily scale the number of ECS containers holding a single worker based on the queue depth.

In practice this would look something like (thanks again to @samdoolin):

class SingleTaskLoader(AppLoader):

    def on_worker_init(self):
        # called when the worker starts, before logging setup
        super().on_worker_init()

        """
        STEP 1:
        monkey patch kombu.transport.virtual.base.QoS.can_consume()
        to prefer to run a delegate function,
        instead of the builtin implementation.
        """

        import kombu.transport.virtual

        builtin_can_consume = kombu.transport.virtual.QoS.can_consume

        def can_consume(self):
            """
            monkey patch for kombu.transport.virtual.QoS.can_consume

            if self.delegate_can_consume exists, run it instead
            """
            if delegate := getattr(self, 'delegate_can_consume', False):
                return delegate()
            else:
                return builtin_can_consume(self)

        kombu.transport.virtual.QoS.can_consume = can_consume

        """
        STEP 2:
        add a bootstep to the celery Consumer blueprint
        to supply the delegate function above.
        """

        from celery import bootsteps
        from celery.worker import state as worker_state

        class Set_QoS_Delegate(bootsteps.StartStopStep):

            requires = {'celery.worker.consumer.tasks:Tasks'}

            def start(self, c):

                def can_consume():
                    """
                    delegate for QoS.can_consume

                    only fetch a message from the queue if the worker has
                    no other messages
                    """
                    # note: reserved_requests includes active_requests
                    return len(worker_state.reserved_requests) == 0

                # types...
                # c: celery.worker.consumer.consumer.Consumer
                # c.task_consumer: kombu.messaging.Consumer
                # c.task_consumer.channel: kombu.transport.virtual.Channel
                # c.task_consumer.channel.qos: kombu.transport.virtual.QoS
                c.task_consumer.channel.qos.delegate_can_consume = can_consume

        # add bootstep to Consumer blueprint
        self.app.steps['consumer'].add(Set_QoS_Delegate)


# Create a Celery application as normal with the custom loader and any required **kwargs
celery = Celery(loader=SingleTaskLoader, **kwargs)

Then we start the worker via the following line:

celery -A proj worker -c 4 --prefetch-multiplier -1

Make sure that you don't forget the --prefetch-multiplier -1 option, which disables fetching new requests at all. This is will make sure that it uses the can_consume monkeypatch.

Now, when the Celery app is up, and you request 6 tasks, 4 will be executed as expected and 2 will end in the queue instead of being prefetched. This is the expected behavior without actually setting acks_late=True.

Then there is one last note I'd like to make. According to Celery's documentation, it should also be possible to pass the path to the SingleTaskLoader when starting the worker in the command line. Like this:

celery -A proj --loader path.to.SingleTaskLoader worker -c 4 --prefetch-multiplier -1

For me this did not work unfortunately. But it can be solved by actually passing it to the constructor.

thijsvdp
  • 404
  • 3
  • 16