3

I recently got an alert from a PE 905 which I manage: I1912 SEL Full. I checked the SEL via the DRAC web UI and saw the following message repeated about 50 times for today:

"The disk drive bay battery has failed"

Followed a few seconds later by the equivalent trouble cleared message (unfortunately I cleared the SEL to see if I was still getting the messages before I could copy down its exact wording).

The trouble is that I wasn't even aware that the drive bay had a battery. (It doesn't, does it?)

The only RAID controller in the box is a PERC 6/i, and its battery is reported as good. I did not see any ROMB errors (nor did I get alerts), nor anything else to indicate the PERC's battery is bad.

Needless to say, I googled the error message but the best I could find was one cross-posted article in Japanese. Via G translate the author appears to indicate that the message could indicate a RAID battery failure or impending controller failure, per Dell.

It looks like he replaced the controller and the battery thereby resolving the issue. But were both replacements required? (I'm on a tight budget, and no, we no longer have Dell service/support on this machine).

With only one available post on this topic, I'd just like to know if anyone could shed more light on this error. I'd be happy to provide any logs, etc, however everything except that message in the SEL looks hunky-dory. In fact, the error has not returned in the past ~hour since clearing the log.

Thanks!

s.co.tt
  • 702
  • 7
  • 15
  • I've called Dell before on equipment that wasn't under an active support agreement and they've helped point me in the right direction. My bet is that if you call them they'll at the very least tell you what the message means and how to resolve it. You have nothing to lose by giving them a call. – joeqwerty Sep 24 '13 at 05:48
  • Thanks Joe, I gave it a shot but I was asked for a service tag. I gave him the correct one (I don't have any other Dells under contract with a 6/i), and he tried to sell me on support. Unfortunately for me it would be cheaper to just replace the battery. – s.co.tt Sep 24 '13 at 15:38

2 Answers2

3

It looks like the original error message was a precursor to a new message, one which actually does turn up some results in Google. After a quiet night, I started getting the following messages in my system log:

The storage battery has failed.
The storage battery is operating normally.

It's the same pattern as was shown last night, but with a different message.

ESM Log showing error message

A Dell Community wiki page reports the detailed description for the error as:

The PERC RAID controller battery may have failed because of thermal exceptions.

Though of course possible that it's a localized thermal issue, the system board temp is currently reported as 26 deg. C, so it's not a system-wide thermal issue.

A similar issue was reported with a PERC 5/i on one of Dell's mailing lists which didn't point to thermal causes, but possible bad/old firmware. (My f/w is up to date).

In my case, after clearing the SEL again, everything was showing good with the controller's battery and no new events appeared in the log. (Seen via OpenManage).

I initiated a learn cycle on the controller's battery, and almost immediately it was now reported as degraded within OM. Thereafter, the log started filling up again with the same messages:

PERC battery shown as degraded

Based upon this new information, I'm fairly confident that the problem is the battery. I'll be replacing it later today when I can get to the server's location.

My hypothesis is that a learn cycle started on the battery and it was at that point that the battery began being reported as bad. Perhaps it was heating up as it charged, thereby causing the repetitive messages as it heated and then cooled.

I'm answering my own question because I'm hoping that this helps anyone searching for my original error message (which on a search yielded no English-language results).

Fortunately a bad controller battery isn't an issue for me because the machine in question is connected to a SAN and the PERC is only responsible for a local OS volume which is not write-intensive. However, one thing to be taken away from this is that if you do rely on write caching and have multiple PERC controllers that use the same battery type, keep at least one extra battery on hand.

Update: In the name of science I let the learn cycle on the battery complete. It took a while, but finished successfully and no new error messages have been added to ESM Log/SEL.

Of course, the battery is still suspect and will be replaced, but I would recommend to anyone experiencing the symptoms I've described to try kicking off a learn cycle.

s.co.tt
  • 702
  • 7
  • 15
0

I have seen a similar behaviour on a couple of Dell-PowerEdge-systems where the battery was about five years old.

What I saw is that the virtualdisk cache was repeatadly switching from write-back to write-through.

When I called Dell-support about this they told me that this could be a sign of a battery that has not enough charge any more. There is a state where the battery is still reported as "OK" in omsa, but the level is not high enough any more nonetheless. You can check this via omsa-command-line:

omconfig storage controller action=exportlog controller=0 This will create a log-file.

On Linux: /var/log/lsi_DDMM.log (Day and Month). This is a ASCII-file (DOS-format) where you will see details about the battery.

Nils
  • 7,695
  • 3
  • 34
  • 73