RabbitMQ Unable to Join Cluster

Question

I am trying to learn clustering rabbitmq nodes and I am following this tutorial as well as the official documentation.

I have 2 physical machines with rabbitmq deployed on them through docker. machine1 (192.168.1.2) is to be the cluster, and machine2 (192.168.1.3) is to join it.

When I attempt to run rabbitmqctl join_cluster rabbit@192.168.1.2 from machine2, this fails with the following message.

Clustering node rabbit@node2.rabbit with rabbit@192.168.1.2
Error: unable to perform an operation on node 'rabbit@192.168.1.2'. Please see diagnostics information and suggestions below.

Most common reasons for this are:

 * Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)
 * CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
 * Target node is not running

In addition to the diagnostics info below:

 * See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more
 * Consult server logs on node rabbit@192.168.1.2
 * If target node is configured to use long node names, don't forget to use --longnames with CLI tools

DIAGNOSTICS
===========

attempted to contact: ['rabbit@192.168.1.2']

rabbit@192.168.1.3:
  * connected to epmd (port 4369) on 192.168.1.2
  * epmd reports node 'rabbit' uses port 25672 for inter-node and CLI tool traffic
  * TCP connection succeeded but Erlang distribution failed
  * suggestion: check if the Erlang cookie identical for all server nodes and CLI tools
  * suggestion: check if all server nodes and CLI tools use consistent hostnames when addressing each other
  * suggestion: check if inter-node connections may be configured to use TLS. If so, all nodes and CLI tools must do that
   * suggestion: see the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more


Current node details:
 * node name: 'rabbitmqcli-1352-rabbit@node2.rabbit'
 * effective user's home directory: /var/lib/rabbitmq
 * Erlang cookie hash: XXXXXXXXXXXXX

The error logs on machine1 show nothing related to such a connection attempt. I have verified the md5sum of the cookies on both docker containers and they are exactly the same. So are the permissions.

I assumed perhaps the port 4369 isn't reachable, but it is.

I am unsure what I am doing wrong. Can someone help here?

Additional information:

I am using the rabbitmq:3.85-management image. It uses Erlang/OTP 23 [erts-11.0.3].

I have been checking the troubleshooting guide, but I am unsure what seems wrong here. Please let me know if I can provide more information.

Something seems strange in the architecture, in the sense that the two docker engines are running standalone...but nevertheless it should work anyway. Have you tried to access http://node-1.rabbit:15672 **and** http://node-2.rabbit:15672 from your VM2? — Neo Anderson, Aug 02 '20 at 06:15
I am unable to reach them through those names. that tutorial claims that I should be able to, but no, it does not. Should nodes exposed via hostnames be allowed to access each other through the special hostname outside of the docker host? — stonecharioteer, Aug 02 '20 at 06:45
When you say that `:4369` is reachable, I suppose you checked the connectivity from VM2 to VM1, but have you also checked from (VM2-rabbit-container) to reach the same port on VM1? I'd bet here is the glitch. — Neo Anderson, Aug 02 '20 at 07:02
I tried netcat both on machine2 and inside the container of machine2. — stonecharioteer, Aug 02 '20 at 07:06
It seems that the only missing piece are the names that are not being resolved. Try to add manually DNS entries in the proper `etc/hosts` files on each VM. Then redo the `netcat`/`nslookup` but this time with the fqdns indicated in the tutorial, from both the VM and the container. — Neo Anderson, Aug 02 '20 at 07:11
For erlang clustering to succeed, you need the following: same cookie everywhere, correct node names and the same 'longnames/shortnames' option everywhere. Have in mind that nodes `rabbit@node2.rabbit` and `rabbit@192.168.1.3` are different at Erlang's level, so make sure that both nodes start named `rabbit@192.168.1,X` or `rabbit@nodeX.rabbit` and `nodeX.rabbit` resolves . — José M, Aug 02 '20 at 09:04
@JoséM you are right. When clustering from within 2 containers on *separate* machines, I need to ensure that the service is accessible via the nodename that Erlang uses across the network. So being accessible by a node's hostname is not sufficient. An easy fix is editing the host file on the containers themselves. Thank you. — stonecharioteer, Aug 05 '20 at 05:57
@NeoAnderson could not tag you too in the response. Thank you. — stonecharioteer, Aug 05 '20 at 05:57

score 2 · Accepted Answer · answered Aug 05 '20 at 06:00

So thanks to @NeoAnderson and @José M, I was able to understand what happened.

The containers running RMQ need to be accessible via the hostname that Erlang uses within the service, across the network. Since the hostname of the containers were not accessible in a container on another machine, this clustering failed.

A simple fix would be to edit the /etc/hosts file on the containers so that it would point the IP to the "leader" node.

I was just doing this to avoid installing RMQ and not because I thought this was the best way to do this. Alternately, docker swarm or k8s would have provided the right networking for me.

But the root cause was definitely the nodename problem.

RabbitMQ Unable to Join Cluster

1 Answers1