How to diagnose false positives with a smart PDU ping'ing devices on LAN

Question

Ive got a smart PDU (Pakedge) which pings, various LAN devices in the rack and if they dont respond it power cycles the associated power outlet to reboot the device, it also sends me an email alert.

What safe guards can i put in place to stop false positives coming up where for instnace the network switch is powered down and then every device would "look offline" / frozen to the PDU ?

UPDATE Some answers have correctly mentioned that force power cycling devices in the above manor could cause issues, i should clarify the device that im doing this to in this above instance are rack mounted AV matrixes / amplifiers rather than servers / NAS

This isn't something I would be doing. Letting the PDU semi-indiscriminately power cycle outlets isn't a "smart" way of power cycling potentially hung devices and it certainly is putting those devices at risk if they run operating systems that need to be "gracefully" shut down. Doing a hard power off on some devices may render them inoperable and may also lead to data corruption and/or loss. If it were me, I'd stop doing this immediately. — joeqwerty, Aug 10 '18 at 19:19
@joeqwerty oversight on my behalf not properly specifying what equipment was being power cycled, please see above updated question. — sam, Aug 12 '18 at 10:20

score 2 · Answer 1 · answered Aug 11 '18 at 18:16

Do not do this. You have already identified a potential problem if a switch goes down and the PDU power cycles other devices. And power cycling has risks for the integrity of some systems that should be shut down gracefully.

Instead, design the high availability that you need.

Define your uptime requirements.
Monitor the services these devices support from the end user perspective. Maybe for a web server, GET the login page and track every http status code.
When the service availability is inadequate, find the root cause of the outage.
Where a single component failed, you can start adding redundancy. Hot spare routers, load balancers, clusters, and so on.

There are cluster implementations that "shoot nodes in the head" by power cycling them. Corosync + Pacemaker, aka Red Hat cluster suite, can do this. But they have an idea of quorum, and only do so when most of the nodes agree it is dead. And a good cluster implementation requires testing to be sure it reliably fails over, and only when necessary.

oversight on my behalf not properly specifying what equipment was being power cycled, please see above updated question. — sam, Aug 12 '18 at 10:20
I still don't think automatic power cycling of equipment based on pings is a good idea, it is inelegant at best. Pings get dropped on overloaded networks or overzealous firewalls, no fault of the device. Be more specific (in your research and your question) about what failure modes you want to reset, and what you do not. — John Mahowald, Aug 13 '18 at 02:06
You also have not defined what the service level of this gear is, how long you have to fix it and how bad downtime is. Nor if you see actual problems with unresponsive gear. Or if your requirements and budget are for TV broadcast level redundancy or much less than that. — John Mahowald, Aug 20 '18 at 15:02

score 0 · Answer 2 · answered Aug 12 '18 at 11:11

Is the network switch manageable ?

If yes, you can think about these solution.

Step 1: ping the switch. If it is running, continue with the other checks. Otherwise do nothing, and start againt to step 1.

If the switch is not pingable, you can for example ping the mail server or another host on the network.

Beware there are pros and cons adding additional checks. You risk in certain conditions to not power cycle the LAN devices.

How to diagnose false positives with a smart PDU ping'ing devices on LAN

2 Answers2