DL180 G6 - ESXI 6.0 - P410 - Lost access to volume Issue

Question

We have a DL180 G6 Server with a P410 RAID Card. The server has the following three RAID arrays.

4x2TB - RAID 10

4x2TB - RAID 10

2x2TB - RAID 1

2x2TB HD's are configured as hot spares for the three arrays.

Following is the relevant output from ESXCLI

Smart Array P410 in Slot 1

Bus Interface: PCI
Slot: 1
Serial Number: PACCR9VYJKGQ
Cache Serial Number: PAAVP9VYJCYN
RAID 6 (ADG) Status: Enabled
Controller Status: OK
Hardware Revision: C
Firmware Version: 2.72
Rebuild Priority: Medium
Expand Priority: Medium
Surface Scan Delay: 15 secs
Surface Scan Mode: Idle
Parallel Surface Scan Supported: No
Queue Depth: Automatic
Monitor and Performance Delay: 60  min
Elevator Sort: Enabled
Degraded Performance Optimization: Disabled
Inconsistency Repair Policy: Disabled
Wait for Cache Room: Disabled
Surface Analysis Inconsistency Notification: Disabled
Post Prompt Timeout: 0 secs
Cache Board Present: True
Cache Status: OK
Cache Ratio: 25% Read / 75% Write
Drive Write Cache: Disabled
Total Cache Size: 512 MB
Total Cache Memory Available: 400 MB
No-Battery Write Cache: Disabled
Cache Backup Power Source: Batteries
Battery/Capacitor Count: 1
Battery/Capacitor Status: OK
SATA NCQ Supported: True
Number of Ports: 2 Internal only
Driver Name: HP HPSA
Driver Version: 6.0.0
PCI Address (Domain:Bus:Device.Function): 0000:06:00.0
Host Serial Number: USE626N2XD
Sanitize Erase Supported: False
Primary Boot Volume: None
Secondary Boot Volume: None
Secondary Boot Volume: None

array A (SATA, Unused Space: 0 MB)

  logicaldrive 1 (3.6 TB, RAID 1+0, OK)

  physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SATA, 2 TB, OK)
  physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SATA, 2 TB, OK)
  physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SATA, 2 TB, OK)
  physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SATA, 2 TB, OK)
  physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SATA, 2 TB, OK, spare)
  physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SATA, 2 TB, OK, spare)

array B (SATA, Unused Space: 0 MB)

  logicaldrive 2 (3.6 TB, RAID 1+0, OK)

  physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SATA, 2 TB, OK)
  physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SATA, 2 TB, OK)
  physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SATA, 2 TB, OK)
  physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SATA, 2 TB, OK)
  physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SATA, 2 TB, OK, spare)
  physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SATA, 2 TB, OK, spare)

array C (SATA, Unused Space: 0 MB)

  logicaldrive 3 (1.8 TB, RAID 1, OK)

  physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SATA, 2 TB, OK)
  physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SATA, 2 TB, OK)
  physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SATA, 2 TB, OK, spare)
  physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SATA, 2 TB, OK, spare)

Now in ESXI we are getting the following errors from time to time.

Lost access to volume 5456cb3e-4fbdb59c-a37a- d8d385644ec0 (datastore2) due to connectivity issues. Recovery attempt is in progress

Keep in mind that it is affecting all the three arrays at the same exact time and in a few seconds all three arrays recover. As per understanding, all of the drives are attached to one single port on the P410 RAID card. Do you think that using both ports could improve performance and potentially remove this recurring issue?

We have tried all software solutions at this point including updating the firmware(updated to 6.64). What can be the other options?

Update 1

The two spare drives were configured as spares for all the three arrays as described above. I removed the spares from all the arrays for about 15 minutes and the errors stopped. Now I have configured the first spare for the first array and the second for the second array to see if the error appears again.

Update 2

Reattaching the spares has caused the error to return and it is affecting all the three arrays. So i am removing the spares one by one to further troubleshoot this issue. This is probably a known issue described here: http://community.hpe.com/t5/ProLiant-Servers-ML-DL-SL/ESXi5x-HPSA-P410i-WARNING-LinScsi-SCSILinuxAbortCommands-1843/td-p/6818369. Fingers crossed.

score 2 · Accepted Answer · answered May 09 '16 at 12:36

The two updates posted in the question and further troubleshooting, lead us to the real answer to the problem. We found out that it was related to driver in ESXI for the P410 raid card. We downgraded to version .60 of the driver available from http://h20564.www2.hpe.com/hpsc/swd/public/detail?swItemId=MTX_d18033ac346f468c92062ce127 and the problem was resolved.

Keep in mind that none of the recent drivers work including version .114, .116 and the recently released .118. So this is the only software solution to the issue unless your problem is related to hardware as described by user @ewwhite.

Keep in mind that this issue only occurs if you are using spare drives with a P410 card in a DL180G6 server. I have also seen posts that it occurs with other HP servers as well so you might try .60 version of the driver on those server to see if it fixes your problem.

While facing this issue you might also see periodic spikes in disk latency without any corresponding read/write load on your server, this is better explained through the following picture:

In the above picture the red dots denote the periodic spikes while the spare was attached. The green dots denote the period while the spare was being removed.

As you can see in the above picture, the latency spikes were not associated with the any corresponding read/write loads and were periodic. In our case these were happening exactly five minutes apart. As soon as the spare was removed the spikes stopped.

To downgrade to the .60 version of the driver please put your machine into maintenance mode after gracefully shutting down the VM's and issue the following commands

cd /tmp
wget http://ftp.hp.com/pub/softlib2/software1/pubsw-linux/p964549618/v97400/scsi-hpsa-5.5.0.60-1OEM.550.0.0.1331820.x86_64.vib
esxcli software vib install -v /tmp/scsi-hpsa-5.5.0.60-1OEM.550.0.0.1331820.x86_64.vib

After that reboot your server. Hope this helps someone. I will update this answer when HP releases a stable version of HPSA driver for P410 which does not cause this issue with spare drives.

Interesting resolution. But the reason this is a problem with the HPSA driver and your particular server model is due to the backplane design. It's definitely an edge case. Also, can you confirm if you were running the HP-specific version of ESXi or not? — ewwhite, May 09 '16 at 12:55
Yes we were using the HP Specific ESXI version. More information about this is available here https://communities.vmware.com/thread/492822?start=0&tstart=0 — Nasoo, May 09 '16 at 14:35
Had same issue DL180 G6 using HP Custom ESXi 5.5 image. Only difference is I downloaded the file directly and uploaded to my datastore then installed with this command. `esxcli software vib install -v /vmfs/volumes/datastoreUUID/temp2/scsi-hpsa-5.5.0.60-1OEM.550.0.0.1331820.x86_64.vib` — Jason, Oct 26 '16 at 19:12

score 1 · Answer 2 · answered May 07 '16 at 17:22

1

This is probably a backplane or backplane expander issue. There's a slight chance it could be cable. And possibly RAID controller.

The DL180 G6 you're using is probably a 12-bay 3.5" unit, and is connected to the Smart Array P410 via a single 4-lane SAS SFF-8087 cable.

Upgrading firmware was the first thing you should have done. Have you had the same problem since updating the controller firmware? You may want to also do the disks' firmware for good measure.

But seeing the design of this server depends entirely on the SAS backplane, and the fact that all disks are impacted at the same time, you're looking at a connection issue that will likely require service or replacement.

answered May 07 '16 at 17:22

ewwhite

197,159
92
443
809

We are just going to reboot now after updating the firmware. The issue therefore did not start after updating the firmware. All of these are internal drives and this is a remote server in a data-center. The way forward therefore seems to be; a. reboot and test. b. replace cable and test c. replace back-plane and test d. replace raid card and test – Nasoo May 07 '16 at 17:56
Yes. That is correct. – ewwhite May 07 '16 at 17:57
Thanks, will start on this and will update this post to reflect the progress. – Nasoo May 07 '16 at 18:18

DL180 G6 - ESXI 6.0 - P410 - Lost access to volume Issue

2 Answers2