4

I'm trying to get ECS Service Discovery working with Prometheus.

Currently my ECS container gets added to Route 53 like so:

+-----------------------------------------------+------+--------------------------------------------------------+
|                     Name                      | Type |                         Value                          |
+-----------------------------------------------+------+--------------------------------------------------------+
| my-service.local.                             | SRV  | 1 1 8080 123456-7890-1234-5678-12345.my-service.local. |
| 123456-7890-1234-5678-12345.my-service.local. | A    | 10.0.11.111                                            |
+-----------------------------------------------+------+--------------------------------------------------------+

I assume if I added more running containers to ECS, I would get more Alias records in Route 53 with the name 123456-7890-1234-5678-12345.my-service.local.

In my Prometheus configuration file, I have supplied the following under scrape_config:

    - job_name: 'cadvisor'
      scrape_interval: 5s
      dns_sd_configs:
      - names:
        - 'my-service.local'
        type: 'SRV'

However, when I check the target status in Prometheus, I see the following:

Endpoint: http://123456-7890-1234-5678-12345.my-service.local:8080/metrics
State: Down
Error: context deadline exceeded

I'm not familiar with how DNS Service Discovery works with SRV records so I'm not sure where the problem lies exactly. Looking at how AWS ECS Service Discovery added the records, it looks like my-service.local maps to 123456-7890-1234-5678-12345.my-service.local:8080

However it looks like Prometheus doesn't then try to find the list of local IPs mapped to 123456-7890-1234-5678-12345.my-service.local and just tries to scrape from it directly.

Is there some configuration option that I'm missing to make this work or have I misunderstood something at a fundamental level?

2 Answers2

1

Turns out the issue was that I needed to add a security group rule to allow my Prometheus instance to talk to my ECS cluster since both were in a public subnet.

Also scaling the desired count in the ECS cluster up creates both another SRV record and an associated A record in Route 53 (not just one additional A record as I previously thought).

Everything seems to work now.

  • 4
    Be aware of this caveat though when using ECS Service Discovery: Route 53 only returns a maximum of 8 records. So if you're running an ECS service of 20 tasks, you will still only get the metrics from 8 random tasks. – siwyd Nov 19 '19 at 17:13
  • @siwyd just ran into this issue last week, what are the alternatives you can suggest? – Murukesh Apr 06 '20 at 05:00
  • 1
    @Murukesh At the time I wrote a small script that scrapes the Route 53 API for the records created by ECS Service Discovery: https://paste.sr.ht/%7Esiwyd/9974e66dd314f2d0d96eaafb23da9644e874ad54. Maybe that can be of some use to you, you'll have to pick out what you can use. Another way might be to to just ditch ECS Service Discovery and simply scrape the ECS API perhaps. It might even be simpler. I don't remember exactly why I decided to scrape Route 53 back then, maybe the ECS implementation would be simpler even. – siwyd Apr 06 '20 at 13:44
0

A fairly good alternative to using a "proper" service discovery like Consul or ECS SD with Route 53 is relying on the AWS API. This is appropriate as long as the total number of containers / tasks stays below a few thousand, since you are limited by the AWS API request cap.

There exist a number of tools that provide this functionality in combination with Prometheus file discovery. For example https://pypi.org/project/prometheus-ecs-discoverer/ or https://github.com/teralytics/prometheus-ecs-discovery

trallnag
  • 2,041
  • 1
  • 17
  • 33