1

I'm running RabbitMQ Docker image (rabbitmq:3-management) in AWS ECS. It's working fine with no issues.

Then I added a bit more complexity and created a service with the same RabbitMQ but now connected to AWS Network Load Balancer (my ultimate goal is to create a RabbitMQ cluster, so I need a few instances behind load balancer). Target group is configured with port 5672 and uses the same port for health checks. Interval between health checks is 30 sec (it's max available). Threshold is 5. In configuration of service in ECS Health check grace period is 120 sec. Should be enough to start service. What happens is that when I run service after a few minutes it gets killed and restarted:

service Rabbit-master (instance i-xxx) (port 5672) is unhealthy in target-group Rabbit-cluster-target-group due to (reason Health checks failed)

'A few minutes' means 2 or 5 or 9... It varies. It doesn't happen on a start but after a while. Also I see that RabbitMQ works fine (in logs and in management panel). So it's exactly ELB which causes its restart. Not that first RabbitMQ died and then ELB restarted it, no.

So my question is what I'm doing wrong and how I can achieve stable work of RabbitMQ in ECS in pair with ELB? Is the idea to use port 5672 for helth checks wrong? But which port then to use? 15672?

Sorry if I provided not enough details. I desribed those which seemed to me relevant. If you need anything more I will be happy to elaborate. Thanks!

ded.diman
  • 165
  • 2
  • 12

3 Answers3

3

Apparently the problem was with configuring security group of RabbitMQ service with IP of NLB. This idea didn't come to me immediately because

  1. restarts happened not right away after service run but after a few minutues
  2. NLB don't have security groups and their IDs are not that obvious to find.

More details are here:

https://forums.aws.amazon.com/thread.jspa?threadID=263245

and here:

https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-register-targets.html#target-security-groups

ded.diman
  • 165
  • 2
  • 12
  • Thanks for this! I had this exact problem and was reading about this all over the place. And only your comment was concise enough for me to figure out what needs to be done. The problem I'm having right now is I'm opening services to the whole VPC, but I'm not thrilled with that. I can get private NLB IPs from the AWS console, but don't know how to use them in the CDK environment. If you know how to close it down, it would help immensely. Thanks! – Andrej Mohar Nov 03 '22 at 17:59
  • BTW, you can use this to get NLB IPs: `aws ec2 describe-network-interfaces --query 'NetworkInterfaces[?contains(Description, NLB-name)].PrivateIpAddresses[*].PrivateIpAddress[]'`. You need to change `NLB-name` with your actual network load balancer name (surrounded by backticks if it is a string), and it should work. For me, it prints two IPs. Haven't tried, but this should be enough to add to the service security group ingress - feel free to correct me if I'm wrong here. – Andrej Mohar Nov 03 '22 at 18:00
  • Thanks! After adding ports to my EKS security group, LB was able to talk to instances in that group! – Anton Kim Dec 13 '22 at 17:14
1

This is very important to specify the health check path or port when attaching your service with ALB.

ALB does not check the response body but it checks the status code, so the only call that will return you 200 status code is curl -I http://127.0.0.1:15672 rest will require authentication or 404 or 403 which LB mark target unhealthy.

enter image description here

As 15672 will return 200.

enter image description here

Also, verify the health check of the desired target group of ECS task, does it point the correct port of the instance. enter image description here

2nd Option: Further, you can write custom health checks for LB which will monitor both port of your container, as ALB check health checks only one port at the time, a simple example can be based on nodejs, so for that its mean you have to run simple node application that will check both port and will response ALB health checks.

In this case, your healthcheck will be /ping and port will be 3007

Below is the code that we use for such ECS task where we need to check multiple port.

   var express = require('express');
const isAllReachable = require('is-all-reachable');
var request = require('request');
var app = express();

app.get('/ping', (req, res) => {

    isAllReachable([
        // first check if all reachable
        'http://localhost:15672'
        // 'http://localhost:otherport'
    ], (err, reachable, host) => {
        //if reachable then do API request if its responding
        if (reachable) {

            console.log("Health check passed");
            console.log("checking rabbitMQ");
            request.get('http://localhost:15672/api/vhosts', {
                'auth': {
                    'user': 'guest',
                    'pass': 'guest',
                    'sendImmediately': false
                }
            }, function(error, response, body) {
                console.log({
                    "status_code": response.statusCode,
                    "body": body
                })
                if (error) {
                    console.log(error)
                    console.log("failed to get vhosts");
                    res.status(500).send('health check failed');
                } else {
                    res.status(200).send('rabbit mq is running');
                }

            })
        } else {
            console.log("health check failed. ", "This server is not reachable", err);
            res.status(500).send('health check failed. one of the port is not reachable.');
            console.log(reachable)
        }
    });
});

    app.listen(3007, () => console.log('LB custom Health check server listening on port 3007!'));

For Rabbit monitoring, in deep you can explore rabbitmq monitoring.

Adiii
  • 54,482
  • 7
  • 145
  • 148
  • I use Network Load Balancer which works with TCP, not HTTP. Hence you don't have to (and can't) provide URL for health check. By default it pings given port. Yes I can add more sophisticated healthcheck to container. Actually I even had one at the beginning of my explorations. It was based on RabbitMQ embedded diagnostics tools. But then I ended up with default port ping since while creating cluster RabbitMQ has to be stopped at some moments. So such health check will fail at this moment, which is not desired behavior. – ded.diman Jun 24 '19 at 20:20
  • so you can set it a health check port 15672, go to your target group and override your health check port, as this port will respond with 200 status code – Adiii Jun 25 '19 at 04:57
  • The problem was in security groups. It was a bit not obvious with NLB but still. Plese see my own answer to the question over. :-) – ded.diman Jun 26 '19 at 10:59
  • got it, NLB dont have secuirty gorup its use secuirty group of the instance – Adiii Jun 26 '19 at 11:04
0

Does your healthcheck url works? it happened to me with ALB. My case was

  • ex: TargetGroup was mapped to /api/profiles => container:4000, but my container didnt have any route to server api/profiles. Because ALB didnt rewrite the path as for ex Nginx. It was searching the api/profiles route in the container, and my route was just /profiles. So i changed the path in the nginx then it worked.

How to diagnose

Ntwobike
  • 2,406
  • 1
  • 21
  • 27
  • Thanks for feedback. But I use Network Load Balancer, not ALB. And NLB works with TCP, not HTTP. So it pings given port for health check. It's simply impossible to specify any URL there. – ded.diman Jun 24 '19 at 20:27