1

We have been having below issues from RabbitMQ and had been manually restarting the servers every weekend as a work around.

Network partition detected
Mnesia reports that this RabbitMQ cluster has experienced a network partition. This is a dangerous situation. RabbitMQ clusters should not be installed on networks which can experience partitions.

We have gone through other popular posts on the topic e.g. here and here

Our network is not highly reliable and occasional blips are expected but when it does come up I would have expected 1 of the 4 node RabbitMQ cluster to join the rest of cluster - as is the case with 4 nodes of Tomcat installed on same servers.

  1. Although the nodes on single partition continue to run independently but doesnt seem like that is a graceful recovery from failure in one node.
  2. We didnt have great luck with using any rabbitmqctl commands like rabbitmqctl cluster_status - It used to sporadically cause the rabbitmq process to hang which needed a sudo kill to RabbitMQ process.

We are at a point of evaluating moving to Kafka or any other message broker that handles message partition well

Any thoughts on working around not needing manual RabbitMQ restarts or ability of Kafka to handle such situation is highly appreciated

Community
  • 1
  • 1
Javaboy
  • 2,044
  • 2
  • 20
  • 24

1 Answers1

2

I think Kafka with replication should be able to handle network partitions quite easily, as long as the number of brokers partitioned is inferior to the replication factor of your topic (aka, the consumers and producers can always reach at least 1 broker for the topics they're operating with).

To avoid backpressure in the clients while Zookeeper discover the partition and propagate the information to the producers and consumer, you may want to set short ZK heartbeating (yes, you'll need ZK, and a cluster too since you absolutely don't want your whole ZK cluster partitioned).

Fair warning though : using a cluster of kafka brokers will drop the FIFO aspect of your message queue which can be pretty disturbing if you're expecting the same order of messages produced by the producers and read by the consumers, which you could expect with RabbitMQ.

C4stor
  • 8,355
  • 6
  • 29
  • 47
  • That's a pretty insightful, order of messages is not important to us - only recovery is - What if the Zookeeper server goes down? - does Zoo Keeper also have some sort of failover mechanism? – Javaboy Jul 01 '15 at 21:17
  • ZK is robust as long as you have at least one server answering if i'm not mistaken (i'm not 100% sure if it's at least one answer or a quorum), hence the need to have a cluster. Servers partitioned will auto rejoin when reconnecting. You can install them on the kafka brokers if you want, but I would advise using another physical disk to not ruin kafka scalability, based on sequential I/O. – C4stor Jul 03 '15 at 07:05