Hardware disk error in ESX Guest, on a vmfs backed drive ... how is this possible?

Question

How can a guest inside ESX find io problems like this?

[ 40.601502] end_request: critical target error, dev sdg, sector 430203456
[ 40.601563] sd 2:0:6:0: [sdg] Unhandled sense code
[ 40.601582] sd 2:0:6:0: [sdg] Result: hostbyte=invalid driverbyte=DRIVER_SENSE
[ 40.601622] sd 2:0:6:0: [sdg] Sense Key : Hardware Error Sense Key : Hardware Error [current] [current] 
[ 40.601661] sd 2:0:6:0: [sdg] Add. Sense: Internal target failureAdd. Sense: Internal target failure
[ 40.601695] sd 2:0:6:0: [sdg] CDB: Write(10)Write(10):: 2a 2a 00 00 02 19 64 a4 05 62 c0 80 00 00 00 00 40 40 00 00

physically the data is on vmfs stored in a raid6 array (adaptec 5805), which seems happy
also the ESX host does not log any problems
the disk size reported by the guest seems the same as the disk size provisioned
through esx the guest has 9 equal 'drives' attached and only 2 exhibit this problem

Maybe a bug in the I/O emulation layer? Have you tried changing the guest's SCSI controller type to see if it changes the behavior? Does accessing the specified sector reproduce the error? Use `dd if=/dev/sdg bs=512 skip=430203455 count=1` for re-reading or just `badblocks -w -b 512 /dev/sdg 430203457 430203455` to do a read-testwrite-rewrite cycle if you are feeling brave. — the-wabbit, Jan 23 '12 at 01:16
What version of kernel do you have there? Upgrade your kernel and see if the error still appear. — Sacx, Jan 24 '12 at 21:38

score 1 · Answer 1 · edited Jan 27 '12 at 22:37

I've experienced similar thing on backup volume for MS SQL in Win 2008 guest under ESX 4.0 - it's a raw volume exposed from NetApp filer.

Guest OS is reporting (and still reports) bad sectors on that volume.
I think this happened because of too many I/O write operations, temporary timeout or filer overload.
No more bad sectors reported. NetApp "disk scrubing" says all is ok. No filer error reported.

But we are going to recreate this volume anyway and see if it fix this.

How about your other volumes on this filer? Can you please check this volume with the "badblocks /dev/sdg" command? (caution: huge read overhead)

score 1 · Accepted Answer · answered Feb 06 '12 at 11:02

1

It was a hardware/firmware problem after all. While the Adaptec 5805 (with latest firmware) was reporting all RAID6 volumes to be in optimal state, it also reported one volume to contain 'Failed Stripes'. The effect of this seems to be, that part of the RAID6 volume becomes unreadable (causing the errors quoted in the question). ESX does not seem to see this directly, but running dd if=/dev/zero of=file-on-damaged-volume directly on the ESXi console ended in an i/o error while there was still plenty of space on the volume.

No amount of arcconf verify / verify_fix runs on volumes and physical devices was able to detect or fix anything ... Eventually I moved all data away from the volume and re-created it on the adaptec level. Now all is well, but my trust in adaptec's ability to safeguard my data is severely damaged.

answered Feb 06 '12 at 11:02

Tobi Oetiker

1,842
13
12

1

This is pretty coherent with the [Sun/Oracle procedure for such situations](http://docs.oracle.com/cd/E19121-01/sf.x4140/820-2396-18/HardwareFirmwareBIOS.html#50446374_12364). There is also [this Adaptec FAQ article about bad stripes](http://ask.adaptec.com/scripts/adaptec_tic.cfg/php.exe/enduser/std_adp.php?p_faqid=14947) which gives some background information on how bad stripes occur and what can be done to prevent them. – the-wabbit Feb 06 '12 at 12:54
Yes, the Sun/Oracle article got me on the right (sad) track. We had a failed disk in this array, but it raid6, so even then there was redundancy, non of the later media checks revealed any errors with the remaining disks ... also the adaptec controller has a BBU so I don't really see any excuse for this behavior :-( Never had any such problems with our areca controllers. – Tobi Oetiker Feb 06 '12 at 22:27
I hardly ever use Adaptec controllers and mainly maintain LSI storage, but this is the first time I stumble upon "bad stripes" too. I wonder if this is something very specific to the Adaptec implementation. – the-wabbit Feb 07 '12 at 10:00

Hardware disk error in ESX Guest, on a vmfs backed drive ... how is this possible?

2 Answers2