1

At my office, we have a server that we suspect its RAID controller (HP Smartarray) is failing. A cold boot, however, does not indicate anything.

Can anyone recommend me a method to stress-test the controller?


Symptoms that makes me suspect a failing controller:

  • Disk access getting slower, queue getting longer
  • Running dmesg on the XenServer console I see many messages similar to this one:

    end_request: I/O error, dev tda, sector 253655584
    

    (the sector number is never the same)

  • When we move the VM to another physical host, we no longer see the above message

  • Running idle (without any running VM), the dmesg no longer emit the above message

A search on Google indicated that the above message is most commonly associated with a failing SmartArray controller.

How can I be sure that the SmartArray controller is failing?

pepoluan
  • 5,038
  • 4
  • 47
  • 72

2 Answers2

4

HP Smart Array controllers don't fail often. Typically failure is sudden and not something that degrades over time.

Either way, you can run offline diagnostics on the array by booting the HP SmartStart DVD included with the server and running the HP Array Diagnostics Utility (ADU).

You didn't indicate the model or generation of your server or the RAID controller (those things are helpful), but the linked DVD image should cover most recent HP systems.

As far as running an online stress test, the stress utility is good for that purpose.

ewwhite
  • 197,159
  • 92
  • 443
  • 809
2

i have experienced erratic behavior from a RAID array when one drive is failing slowly, but not enough to completely die or cross a counter threshold to indicate failure.

first: i assume you have your RAID set up in some sort of redundant configuration such as RAID 10 or RAID 5? and that you have a hot spare configured (or at least have a spare drive on hand)?

launch the hp array management software and look at the SMART data for each drive. identify any drives that have significantly more errors that the others.

starting with ones of the drives you identified, take out a drive. wait for the hot spare to rebuild if you have one. then test again and see if the situation improves. if it does, then you have found your drive. if not, reinstall the drive and repeat with the next.

also, it has been my experience that upgrading the firmware on the hard drives and the controllers improved the detection of failing drives.

longneck
  • 23,082
  • 4
  • 52
  • 86