HPC OSS Node issue with unreadable local hdd error

Question

We have a HPC setup with four OSS server(OSS1 to OSS4) and two MDS Nodes(MDS1 to MDS2) It has been running till yesterday without any problem. Today morning i found that OSS4 is in shutdown condition. I have verified the OSS3 logs and found that it has been got to fencing state I have again switched on OSS4 now its running

In OSS4 logs i saw some "unreadable" error as mentioned below

Feb 26 04:24:43 oss4 smartd[9306]: Device: /dev/sda, 2 Currently unreadable (pending) sectors 
Feb 26 04:54:43 oss4 smartd[9306]: Device: /dev/sda, 2 Currently unreadable (pending) sectors 
Feb 26 05:24:43 oss4 smartd[9306]: Device: /dev/sda, 2 Currently unreadable (pending) sectors 
Feb 26 05:54:43 oss4 smartd[9306]: Device: /dev/sda, 2 Currently unreadable (pending) sectors 
Feb 26 06:24:43 oss4 smartd[9306]: Device: /dev/sda, 2 Currently unreadable (pending) sectors 
Feb 26 06:54:43 oss4 smartd[9306]: Device: /dev/sda, 2 Currently unreadable (pending) sectors 
Feb 26 07:24:43 oss4 smartd[9306]: Device: /dev/sda, 2 Currently unreadable (pending) sectors

/dev/sda is a local hard disk. Is it possible the Node fencing is due to this error ? While running the e2fsck will resolve this issue ?

Herewith i have attached the /var/log/messages of OSS3 and OSS4 can anybody please analyse the log file and kindly assist me what to do ?

score 1 · Answer 1 · answered Feb 27 '12 at 12:18

That disk is broken. Hopefully it's in a RAID1 pair. Pull out the broken one, put in a new one, let it resync.
Send the bust one back to the manufacturer for RMA.

Hopefully your system has monitoring that will have already alerted the vendor to the problem, and they might even have already shipped you a new disk.

Either way, it's shagged. Replace it.

HPC OSS Node issue with unreadable local hdd error

1 Answers1