0

I've got an HP MicroServer N54L running Linux Mint 17.2 (fresh install). Every few days, I find the machine in a powered-off state (standby, actually - not in the sense that it went to sleep, but in the sense that it is not running, but has power and can be booted started by hitting the power button).

I've run memtest86 on it with no results. I can't find much interesting in kern.log, syslog, dmesg, etc., with the exception of:

Aug  1 06:14:16 donbot kernel: [388813.031331] radeon 0000:01:05.0: BAR 6: [??? 0x00000000 flags 0x2] has bogus alignment 
Aug  1 06:14:16 donbot kernel: [388813.031346] pci 0000:00:14.4: PCI bridge to [bus 03]

in kern.log, just before the power loss. And

Aug  1 15:20:35 donbot kernel: [    3.260404] radeon 0000:01:05.0: registered panic notifier

in kern.log, upon rebooting the machine. Prior to installing Mint 17.2, my Mint 16 installation was experiencing the same power cuts, and I tried briefly to get linux-crashdump working, but was unable to get any crash dumps out of it.

Sometimes, the power loss seems to happen while my snapraid cron job is running an integrity check on my drives. This is a fairly intensive process, but never takes more than about 1/3 of the system memory, nor more than one of the two cpu's. I'm pretty sure that some of the crashes have happened at a time when nothing was running. (I just now successfully ran a 6-hour snapraid scrub of all disks without incident. However, I don't recall having this issue prior to scheduling daily snapraid runs via cron.)

The machine is running headless most of the time, so I'm not sure what the radeon driver has to do with it. (There's no graphics card installed, this is presumably the on-board graphics.) I've installed sysstat for more monitoring options.

I believe I witnessed one of these crashes first-hand just now. I was running snapraid in one shell, and attempted to more /var/log/sysstat/sa01 (which I know is a binary file). The system, possibly coincidentally, froze just as I was hitting the return key on the more command.

I'm at a bit of a loss here. It smells like a hardware problem - but as I mentioned, I've run memtest86 and haven't been able to force an error. (The server has ECC memory, btw.)

The machine is plugged into a surge suppressor. None of the other equipment in that closet seems to reset itself. However, I notice that when listening to music from this server (it's directly plugged into an amp), I'll get a brief burst of static every once in a while.

How can I attempt to track this down further?

meeotch
  • 9
  • 2
  • 1
    Have you considered the possibility that your facility has electrical problems? – Michael Hampton Aug 02 '15 at 00:07
  • I have considered that. As mentioned, though, neither the amplifier nor the video security system that are both in that same closet (amp is on the same circuit as the server, in fact) show signs of having been "tripped", and I never notice flickering lights or anything. If there's some sort of cheap power line monitor that I could plug into that circuit to detect fluctuations, I'd be interested to hear about it. I'm reluctant to hire an electrician for a problem that's so sporadic. – meeotch Aug 02 '15 at 00:13
  • Those brief bursts of static are a big clue that you've probably missed something electrical. I would have an electrician check it out. – Michael Hampton Aug 02 '15 at 00:15
  • 2
    ...completing my earlier comment: "I'm reluctant to hire an electrician for a problem that's so sporadic. At least until I can prove that it is indeed an electrical problem." Hence, a cheap diagnostic line monitor would be worth the effort. – meeotch Aug 02 '15 at 00:21
  • @user6081 The question about what equipment to use to monitor the voltage delivered to the unit is a product recommendation question, and thus might be off-topic. Moreover questions about such equipment might be better suited on a site about electrical engineering, even if not strictly off-topic here. – kasperd Aug 02 '15 at 13:28

0 Answers0