5

So I am using a RabbitMQ + Celery to create a simple RPC architecture. I have one RabbitMQ message broker and one remote worker which runs Celery deamon.

There is a third server which exposes a thin RESTful API. When it receives HTTP request, it sends a task to the remote worker, waits for response and returns a response.

This works great most of the time. However I have notices that after a longer inactivity (say 5 minutes of no incoming requests), the Celery worker behaves strangely. First 3 tasks received after a longer inactivity return this error:

exchange.declare: connection closed unexpectedly

After three erroneous tasks it works again. If there are not tasks for longer period of time, the same thing happens. Any idea?

My init script for the Celery worker:

# description "Celery worker using sync broker"

console log

start on runlevel [2345]
stop on runlevel [!2345]

setuid richard
setgid richard

script
chdir /usr/local/myproject/myproject
exec /usr/local/myproject/venv/bin/celery worker -n celery_worker_deamon.%h -A proj.sync_celery -Q sync_queue -l info --autoscale=10,3 --autoreload --purge
end script

respawn

My celery config:

# Synchronous blocking tasks
BROKER_URL_SYNC = 'amqp://guest:guest@localhost:5672//'
# Asynchronous non blocking tasks
BROKER_URL_ASYNC = 'amqp://guest:guest@localhost:5672//'

#: Only add pickle to this list if your broker is secured
#: from unwanted access (see userguide/security.html)
CELERY_ACCEPT_CONTENT = ['json']
CELERY_TASK_SERIALIZER = 'json'
CELERY_RESULT_SERIALIZER = 'json'
CELERY_TIMEZONE = 'UTC'
CELERY_ENABLE_UTC = True
CELERY_BACKEND = 'amqp'

# http://docs.celeryproject.org/en/latest/userguide/tasks.html#disable-rate-limits-if-they-re-not-used
CELERY_DISABLE_RATE_LIMITS = True

# http://docs.celeryproject.org/en/latest/userguide/routing.html
CELERY_DEFAULT_QUEUE = 'sync_queue'
CELERY_DEFAULT_EXCHANGE = "tasks"
CELERY_DEFAULT_EXCHANGE_TYPE = "topic"
CELERY_DEFAULT_ROUTING_KEY = "sync_task.default"
CELERY_QUEUES = {
    'sync_queue': {
        'binding_key':'sync_task.#',
    },
    'async_queue': {
        'binding_key':'async_task.#',
    },
}

Any ideas?

EDIT:

Ok, now it appears to happen randomly. I noticed this in RabbitMQ logs:

=WARNING REPORT==== 6-Jan-2014::17:31:54 ===
closing AMQP connection <0.295.0> (some_ip_address:36842 -> some_ip_address:5672):
connection_closed_abruptly
Richard Knop
  • 81,041
  • 149
  • 392
  • 552
  • Waiting for response in third server is long? – Omid Raha Jan 06 '14 at 18:06
  • @OmidRaha No, the response comes back immediately. But sometimes it's an error. – Richard Knop Jan 06 '14 at 22:10
  • Is there any proxy between them? – Omid Raha Jan 06 '14 at 22:14
  • @OmidRaha There is a load balancer. I am using AWS, and there is ELB (elastic load balancer) in front of the RabbitMQ. I have been trying different settings. I now get "Socket closed" about 10% of the time. 9 tasks go through successfully, then one task fails... and so on. – Richard Knop Jan 06 '14 at 22:20
  • I have three servers, one is RESTful API server, which takes HTTP requests and sends tasks to RabbitMQ, tasks get executed remotely on a worker server and results go back to the API server. All three are EC2 instances, there is a load balancer in from of the RabbitMQ as I will want to create a cluster later. – Richard Knop Jan 06 '14 at 22:22

2 Answers2

3

Is your RabbitMQ server or your Celery worker behind a load balancer by any chance? If yes, then the load balancer is closing the TCP connection after some period of inactivity. In which case, you will have to enable heartbeat from the client (worker) side. If you do, I would not recommend using the pure Python amqp lib for this. Instead, replace it with librabbitmq.

pbhowmick
  • 1,093
  • 11
  • 26
2

The connection_closed_abruptly is caused when clients disconnecting without the proper AMQP shutdown protocol:

channel.close(...)

Request a channel close.

This method indicates that the sender wants to close the channel. This may be due to internal conditions (e.g. a forced shut-down) or due to an error handling a specific method, i.e. an exception. When a close is due to an exception, the sender provides the class and method id of the method which caused the exception.

After sending this method, any received methods except Close and Close-OK MUST be discarded. The response to receiving a Close after sending Close must be to send Close-Ok.

channel.close-ok():

Confirm a channel close.

This method confirms a Channel.Close method and tells the recipient that it is safe to release resources for the channel.

A peer that detects a socket closure without having received a Channel.Close-Ok handshake method SHOULD log the error.

Here is an issue about that.

Can you set your custom configuration for BROKER_HEARTBEAT and BROKER_HEARTBEAT_CHECKRATE and check again, for example:

BROKER_HEARTBEAT = 10 
BROKER_HEARTBEAT_CHECKRATE = 2.0
Omid Raha
  • 9,862
  • 1
  • 60
  • 64
  • It's still happening. Even with heartbeat. – Richard Knop Jan 06 '14 at 22:11
  • @RichardKnop I think one solution in these circumstances is that you change your `Broker` from `RabbitMQ` to `Redis`. – Omid Raha Jan 07 '14 at 22:22
  • If your are decided to switch to `Redis` as `broker` of `celery`, this [topic](http://stackoverflow.com/q/9140716/538284) is useful. – Omid Raha Jan 08 '14 at 17:46
  • I have actually found a workaround. Setting BROKER_POOL_LIMIT = 0 solves the problem. For each task I am opening a new connection now and it is working. Not ideal but works fine. I don't want to switch from RabbitMQ. – Richard Knop Jan 09 '14 at 09:46
  • This is not unusual, probably the network is unreliable. All applications using sockets must consider that connections can be broken at any point. You should see the celery_task_publish_retry settings. Abruptly closed means that the client did not go through the extra dance around to close a connection, which shouldn't be a worry. Redis also uses sockets so the situation is not different there... – asksol Jan 09 '14 at 15:29
  • I think I have the same issue, so far I added the heart beats also increased `BROKER_CONNECTION_TIMEOUT` but will try `BROKER_POOL_LIMIT = 0` next. I have task chains and every once in a while one task in the chain becomes pending for ever. – dashesy Apr 11 '15 at 02:58