0

Every day I get an email from the logwatch program on our company web server. The server is running CentOS 6 and uses Apache, MySQL and PHP to serve web pages. It is a dedicated piece of hardware (not a VPS) in a data centre in the UK. We have two USB drives attached to it which we backup to.

This morning (under the Kernel Begin heading) I saw the following errors:

http://pastebin.com/raw.php?i=W8ZBf5E8

It looks to me like the errors are focused on the first USB drive (/dev/sdc1). My questions are:

  • Are these errors cause for concern?
  • Do these errors indicate that the USB drive may be about to fail?
  • What would be your recommended course of action?

Incase it will help someone to further diagnose the issue, here is a list of all hard drives attached to that server:

http://pastebin.com/raw.php?i=FKkLsuah

Any help or advice is gratefully received.

1 Answers1

1

It is better to answer such questions by going through the logs with the time in them to get a sense of what happened and what else is missing from the summary but I'll try my best.

The disk failed to respond in time which is the source of the "task aborts", then it failed to reply to the task aborts themselves which resulted in the "target reset". This succeeded at least. It gets worse if it fails possibly causing the entire server to fail after a failed host reset that would follow a target reset failure.

The root cause though is that the disk didn't respond in time, assuming you are running with the default 30 second timeout this means the disk has some problem. It could be a one-off problem that the disk self corrected or it could be an indication of an impending failure. It is hard to tell and the handling depends on your sensitivity to the disk importance. You should however check that you have backups for the data on the disk and that the backups are in usable state.

You should look at the disk information with smartctl (assuming it is SATA) and you can try to use diskscan to read through the disk and show you the latency graph over the disk surface. If there are too many places where the latency is high (above a few seconds) you should rewrite the disk and/or replace it. diskscan has an option to fix the disk by which it means that it will rewrite the seemingly bad locations.

Baruch Even
  • 1,073
  • 6
  • 18