11

I'm trying to move away from SQS to RabbitMQ for messaging service. I'm looking to build a stable high availability queuing service. For now I'm going with cluster.

Current Implementation , I have three EC2 machines with RabbitMQ with management plugin installed in a AMI , and then I explicitly go to each of the machine and add

sudo rabbitmqctl join_cluster rabbit@<hostnameOfParentMachine>

With HA property set to all and the synchronization works. And a load balancer on top it with a DNS assigned. So far this thing works.

Expected Implementation: Create an autoscaling clustered environment where the machines that go Up/Down has to join/Remove the cluster dynamically. What is the best way to achieve this? Please help.

Josh Durham
  • 1,632
  • 1
  • 17
  • 28
Karthik
  • 929
  • 2
  • 12
  • 24
  • autoscaling based on? cloudwatch? – Gabriele Santomaggio Jul 10 '15 at 13:51
  • Yes . But then the scaled instance has to join the cluster automatically. – Karthik Jul 11 '15 at 04:06
  • 1
    Be careful - Erlang in clustered mode is not tolerant to network partitions (including micro-partitions), and may cause some problems; I had regular micro-partitions on AWS which would bring my cluster down. I would recommend running a staging cluster for a while before committing to it for production. – Chris Heald Jul 12 '15 at 20:21

2 Answers2

14

I had a similar configuration 2 years ago.

I decided to use amazon VPC, by default my design had two RabbitMQ instances always running, and configured in cluster (called master-nodes). The rabbitmq cluster was behind an internal amazon load balancer.

I created an AMI with RabbitMQ and management plug-in configured (called “master-AMI”), and then I configured the autoscaling rules.

if an autoscaling alarm is raised a new master-AMI is launched. This AMI executes the follow script the first time is executed:

#!/usr/bin/env python
import json
import urllib2,base64

if __name__ == '__main__':
    prefix =''
    from subprocess import call
    call(["rabbitmqctl", "stop_app"])
    call(["rabbitmqctl", "reset"])
    try:
        _url = 'http://internal-myloadbalamcer-xxx.com:15672/api/nodes'
        print prefix + 'Get json info from ..' + _url
        request = urllib2.Request(_url)

        base64string = base64.encodestring('%s:%s' % ('guest', 'guest')).replace('\n', '')
        request.add_header("Authorization", "Basic %s" % base64string)
        data = json.load(urllib2.urlopen(request))
        ##if the script got an error here you can assume that it's the first machine and then 
        ## exit without controll the error. Remember to add the new machine to the balancer
        print prefix + 'request ok... finding for running node'


        for r in data:
            if r.get('running'):
                print prefix + 'found running node to bind..'
                print prefix + 'node name: '+ r.get('name') +'- running:' + str(r.get('running'))
                from subprocess import call
                call(["rabbitmqctl", "join_cluster",r.get('name')])
                break;
        pass
    except Exception, e:
        print prefix + 'error during add node'
    finally:
        from subprocess import call
        call(["rabbitmqctl", "start_app"])


    pass

The scripts uses the HTTP API “http://internal-myloadbalamcer-xxx.com:15672/api/nodes” to find nodes, then choose one and binds the new AMI to the cluster.

As HA policy I decided to use this:

rabbitmqctl set_policy ha-two "^two\." ^
   "{""ha-mode"":""exactly"",""ha-params"":2,"ha-sync-mode":"automatic"}"

Well, the join is “quite” easy, the problem is decide when you can remove the node from the cluster.

You can’t remove a node based on autoscaling rule, because you can have messages to the queues that you have to consume.

I decided to execute a script periodically running to the two master-node instances that:

  • checks the messages count through the API http://node:15672/api/queues
  • if the messages count for all queue is zero, I can remove the instance from the load balancer and then from the rabbitmq cluster.

This is broadly what I did, hope it helps.

[EDIT]

I edited the answer, since there is this plugin that can help:

I suggest to see this: https://github.com/rabbitmq/rabbitmq-autocluster

The plugin has been moved to the official RabbitMQ repository, and can easly solve this kind of the problems

Gabriele Santomaggio
  • 21,656
  • 4
  • 52
  • 52
  • Your script works like a charm! thanks a ton . Regarding taking the node away fro mthe load balancer im using the autoscaling policy based on the memory utilization i.e <40% . – Karthik Jul 28 '15 at 06:40
  • I was able to do that via custom metric for memory as per aws documentation on custom metric . Thanks for the code . – Karthik Sep 10 '15 at 13:58
  • 3
    Isn't the point of the clustering such that you _can_ remove a node because the data are replicated between the nodes? – Volte Oct 13 '16 at 13:52
  • @K-Iyer can you please share how did you handle with removing nodes from the cluster automatically? – Berlin Mar 19 '17 at 23:15
  • @Berlin Yes that is something I put it on pending while running against time and havent found time to look at it yet. Currently those stay as the inactive nodes in the rabbitmq admin page. :P sorry about that – Karthik Apr 13 '17 at 06:27
  • @Karthik you can clean dead nodes with autocluster plugin, it work very well – Berlin Apr 13 '17 at 12:18
  • Hi all, this post is bit old, we are working on: https://github.com/rabbitmq/rabbitmq-autocluster so I suggest to use it to scale on AWS – Gabriele Santomaggio Apr 13 '17 at 12:50
  • Nice to see it as a plugin @Gabriele. Will definitely be useful. – Karthik Apr 14 '17 at 05:30
1

We recently had similar problem.

We tried to use https://github.com/rabbitmq/rabbitmq-autocluster but found it overcomplicated for our use case.

I created terraform configuration to spin X RabbitMQ nodes on Y subnets (availability zones) using Autoscaling Group.

TL;DR https://github.com/ulamlabs/rabbitmq-aws-cluster

The configuration creates IAM role to allow nodes to autodiscover all other nodes in the Autoscaling Group.