There are generally two high level types of checks used to monitor a service that is running on a traditional instance or virtual machine: host level checks and service level checks.
Host level checks are typically performed with an agent and/or your cloud providers monitoring stack and monitor metrics such as CPU utilization, CPU load, free-able memory, free disk space, etc.
Service level checks monitor the service itself, most often through a pre-defined healthcheck endpoint such as /healthcheck
. You would configure a service check to perform a HTTP GET against that endpoint, and if a 200 response isn't provided, emit an alert for the bad state.
Here are some other basic examples to consider for setting up a healthcheck:
- Check the service documentation (or build one into your service) for a pre-existing healthcheck endpoint
- If the service is a web service or has any HTTP endpoints, consider using those as a target for a healtcheck.
- If the service outputs logs to disk or syslog you can monitor the logs for keywords that indicate a fault or monitor for a log that has not been updated for a certain interval
- If the service has a load balancer in front of it, for example an Amazon ELB or Google NLB, you can monitor responses from the server from the metrics they provide you.
In large distributed environments, it is common to collect stats into time series databases such as Graphite or InfluxDB. Your monitoring server regularly checks specific metrics over a set period for anomalies.
Using ICMP is not an ideal check as it's the most basic form of a host level check. It won't report the status of the service itself and should be one of your last options.
Update
I saw that this answer was marked as not answering the original question which surprised me a bit. I'll be more direct. Don't use ICMP to monitor host level stats for the reasons I mentioned above.