1

I use check_mk_agent for monitoring a server with IPMI and the freeipmi-tools installed. As far as I can see, the monitoring randomly detects no value returned by the IPMI Sensor "Temperature_PCH_Temp".

That's a problem since it results in a CRITICAL state triggering a notification. The interruption lasts only over one check, the following is always OK. The temperature is in no edge area and neither the readings before the fail nor after show a Temp that is tending to overrun a treshold.

Has someone an idea on what could be the reason for this behaviour and how prevent it?

voretaq7
  • 79,879
  • 17
  • 130
  • 214

4 Answers4

1

Version 01.78 of the Supermicro IPMI for my X9DRD-iF. You can download it at http://www.supermicro.com/about/policies/disclaimer.cfm?url=/support/resources/getfile.aspx?ID=1940

0

Sounds like a hardware fault (flaky IPMI board, bad sensor) -- You should contact your hardware vendor and report the problem to see if you can get a replacement.

voretaq7
  • 79,879
  • 17
  • 130
  • 214
  • I informed the responsible contact person that will handle the replacement in case you're right. Let's see what happens... – Julian Kessel Nov 16 '12 at 19:37
  • @JulianKessel In case I'm wrong, they can probably also get you to a support person who can give you a more definite answer :-) – voretaq7 Nov 16 '12 at 20:02
0

The FreeIPMI ipmi-sensor/ipmimonitoring tools reports N/A when it finds a sensor that does not have a reading returned. Although rare (and as voretaq7 says, it's likely a busted sensor) it's not unreasonable for an IPMI sensor to simply say "I don't have a reading for you right now."

I can't speak to what is in the check_mk_agent script, it's possible it considers "N/A" critical and reports it back that way.

It's also possible the remote system (if busted) is returning illegal values back to you, which could lead to a "CRITICAL" state when --output-sensor-state is used.

You may want to look at and see if the --ignore-not-available-sensors or --ignore-unrecognized-events options will help you in this situation.

Albert Chu
  • 686
  • 5
  • 5
0

You do have configured retries for the check - so it doesn't alert you just because it had a short hickup, right?

btw, I think Albert Chu is correct about N/A being handled incorrectly. It's probably only evaluated at first inventory of the system; there's a mail with relevant patches by a user named Bernhard Schmidt on the check_mk mailing lists.

But, as this thread proves, such problems are basically always just related to hardware issues anyway :)

Florian Heigl
  • 1,479
  • 12
  • 20