We have two RabbitMQ clusters. One is used as upstream and the other is used as downstream. Each cluster has 3 nodes. We receive messages being published on the orders exchange on the upstream cluster to the downstream cluster using RabbtitMQ - Federation.
This was working fine since the last 6 months. Suddenly on 10/4 we received a cluster partition error on the upstream cluster. It was due to one of the systems in the cluster getting hung for more than a minute. We saw recurring information about this in the system logs and temporarily brought that system down. The upstream cluster is now running as a two node cluster. it was not noticed then, but on 10/8 we realized that on the downstream node we are not getting any order messages since 10/4.
Upon further investigation, I found that the federation link on the downstream cluster is still showing as running but there are 87000+ messages accumulated on the auto created "federation" queue on the upstream cluster. In order to retrieve those messages, I restarted the federation link from the downstream cluster. But unexpectedly, I saw the "federation" queue getting deleted and recreated on the portal, taking those 87000+ messages also into the darkness of space. We started getting any new messages from that time onwards, but the old messages were just lost.
Before putting the solution into place, we did some POC on this by shutting down both the clusters one by one. Each time, the federation queue was able to retain the persistent messages. And whenever both the clusters were in the right state, the downstream federation link was able to fetch those messages. So, we came to the conclusion that whenever the "federation" queue is available on the upstream node, the "federation link" on the downstream side should pick the messages; and hence we never anticipated this issue.
Neither we set x-expire and x-message-ttl parameters on the federation configuration nor the app sets these when publishing the messages. We only use "trust-user-id": false, URI (all 3 cluster nodes) and exchange name in the federation configuration. Rest all is default which means "x-expire" on the federation queue should be set to "never" (which should cause the queue to live forever unless the federation link is deleted in the downstream side). Our messages are also published as persistent.
Only the logs on the upstream system has the relevant information about this problem during the federation link restart. The snippet is mentioned below. It says that the queue is initialized from "0" depth.
I want to ask the following questions -
- Is our understanding about federation is correct (in context of what is mentioned above)? We do not have a way to reproduce the problem. But does someone knows the cause for it or any missing setting at our end?
- With each "federation link" restart on the downstream side, does the "federation" queue always gets recreated on the upstream side?
- Is there a command to see the creation time stamp of objects like queues and exchanges?
- What best practices or techniques we can follow to ensure that the federation queue is not deleted?
RabbitMQ versions: - Upstream: RabbitMQ 3.6.1, Erlang R16B03-1 - Downstream: RabbitMQ 3.6.15, Erlang 20.3.4
Log snippet from the upstream rabbitmq node. No other relevant log was found.
++++++++++++++++++++++++++++++++++++++++++++++++++++++
=WARNING REPORT==== 8-Oct-2018::14:57:38 ===
closing AMQP connection <0.1688.0> (:51364 -> :5672):
client unexpectedly closed TCP connection
=INFO REPORT==== 8-Oct-2018::14:58:07 ===
accepting AMQP connection <0.521.123> (:46659 -> :5672)
=INFO REPORT==== 8-Oct-2018::14:58:08 ===
Mirrored queue 'federation: order.exch -> ' in vhost 'production': Adding mirror on node 'rabbit@upstream-hostname': <7719.25968.3282>
=INFO REPORT==== 8-Oct-2018::14:58:08 ===
Mirrored queue 'federation: order.exch -> ' in vhost 'production': Synchronising: 0 messages to synchronise
=INFO REPORT==== 8-Oct-2018::14:58:08 ===
Mirrored queue 'federation: order.exch -> ' in vhost 'production': Synchronising: batch size: 4096
=INFO REPORT==== 8-Oct-2018::14:58:08 ===
Mirrored queue 'federation: order.exch -> ' in vhost 'production': Synchronising: all slaves already synced
=INFO REPORT==== 8-Oct-2018::14:58:09 ===
accepting AMQP connection <0.567.123> (:46659 -> :5672)
++++++++++++++++++++++++++++++++++++++++++++++++++++
Please let me know if you need more information from my end to answer these questions.