0

We use icmp ping to determine host up/down status from icinga2 in our AWS EC2 environment. This works well, but we have had a handful of issues where host will fail ping but its services are still ok.

My colleague is under the impression that Amazon occasionally throttles icmp traffic and that this is the cause of our false alerts.

So, two questions:

  1. is this true? Does Amazon sometimes throttle icmp?

  2. is there a better alternative to icmp ping for our monitoring system to use to determine host up/down ?

Of course we also monitor services, but a host up/down is useful to monitor in cases where the host, rather than service, has gone down.

Michael Martinez
  • 2,645
  • 3
  • 24
  • 35

2 Answers2

0

I don't know any specific about Amazon's implementation so my answer will only be about generic behavior.

There are RFC which recommend throttling of ICMP packets generated by a node, but those do not apply to forwarding of ICMP packets by routers. I don't know of any good reason for a router to throttle some forwarded packets differently from others.

Sending packets for different port numbers over different links for the purpose of load balancing is however a feature which lots of hardware has, and if such a feature is in use it is entirely possible that your ICMP packets gets routed over a different physical network path than packets to your service port. That is a possible explanation for observing a difference.

Notice that in a properly configured setup you should expect to see ICMP echo requests being more reliable than your service not less reliable. And that in itself is reason enough to avoid using ICMP echo requests for health checks.

The reason for ICMP echo requests being more reliable is that they have fewer dependencies. An ICMP echo request is replied to by the kernel networking stack and as such you can have a machine in a very bad shape and still be capable of responding to ICMP echo requests.

kasperd
  • 30,455
  • 17
  • 76
  • 124
-1

There are generally two high level types of checks used to monitor a service that is running on a traditional instance or virtual machine: host level checks and service level checks.

Host level checks are typically performed with an agent and/or your cloud providers monitoring stack and monitor metrics such as CPU utilization, CPU load, free-able memory, free disk space, etc.

Service level checks monitor the service itself, most often through a pre-defined healthcheck endpoint such as /healthcheck. You would configure a service check to perform a HTTP GET against that endpoint, and if a 200 response isn't provided, emit an alert for the bad state.

Here are some other basic examples to consider for setting up a healthcheck:

  1. Check the service documentation (or build one into your service) for a pre-existing healthcheck endpoint
  2. If the service is a web service or has any HTTP endpoints, consider using those as a target for a healtcheck.
  3. If the service outputs logs to disk or syslog you can monitor the logs for keywords that indicate a fault or monitor for a log that has not been updated for a certain interval
  4. If the service has a load balancer in front of it, for example an Amazon ELB or Google NLB, you can monitor responses from the server from the metrics they provide you.

In large distributed environments, it is common to collect stats into time series databases such as Graphite or InfluxDB. Your monitoring server regularly checks specific metrics over a set period for anomalies.

Using ICMP is not an ideal check as it's the most basic form of a host level check. It won't report the status of the service itself and should be one of your last options.

Update I saw that this answer was marked as not answering the original question which surprised me a bit. I'll be more direct. Don't use ICMP to monitor host level stats for the reasons I mentioned above.

funkytown
  • 86
  • 3