1

Im having an issue which I have a really hard time debugging. Running ZFS, by system "hiccuped", dumped some information into DMESG, and continued working.

My ZFS is hosting VMs on ESXi. When this issue occurs, many of the VMs experience block IO errors, and some of them drop into read-only mode, requiring restores from backup or fsck to repair the filesystems. This issue only occurs very occasionally, and I have hammered the system, trying to stress it out, it does not seem to be performance related. Only occurs every few months, so conclusively solving it seems to be a pipe-dream to me.

First off, some info about my system (Centos 7, 4.5).

[root@zfs-head ~]# name -a

Linux zfs-head 4.5.0-1.el7.elrepo.x86_64 #1 SMP Mon Mar 14 10:24:58 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux

dmesg entries:

[4331253.022999] sd 2:0:28:0: [sdaa] tag#2 CDB: Read(10) 28 00 10 a8 3d b5 00 00 20 00
[4331253.023006] mpt3sas_cm0:   sas_address(0x5000c500837f31f2), phy(8)
[4331253.023008] mpt3sas_cm0:   enclosure_logical_id(0x50010c60004d41ff),slot(0)
[4331253.023010] mpt3sas_cm0:   enclosure level(0x0003), connector name(     )
[4331253.023013] mpt3sas_cm0:   handle(0x002d), ioc_status(scsi data underrun)(0x0045), smid(222)
[4331253.023016] mpt3sas_cm0:   request_len(131072), underflow(16384), resid(131072)
[4331253.023018] mpt3sas_cm0:   tag(0), transfer_count(0), sc->result(0x00000000)
[4331253.023020] mpt3sas_cm0:   scsi_status(check condition)(0x02), scsi_state(autosense valid )(0x01)
[4331253.023023] mpt3sas_cm0:   [sense_key,asc,ascq]: [0x06,0x2a,0x01], count(96)
[4331253.023030] sd 2:0:28:0: Mode parameters changed
[4331266.475222] sd 2:0:29:0: [sdab] tag#29 CDB: Write(10) 2a 00 09 97 6e c1 00 00 02 00
[4331266.475229] mpt3sas_cm0:   sas_address(0x5000c500837f25c6), phy(9)
[4331266.475232] mpt3sas_cm0:   enclosure_logical_id(0x50010c60004d41ff),slot(1)
[4331266.475234] mpt3sas_cm0:   enclosure level(0x0003), connector name(     )
[4331266.475237] mpt3sas_cm0:   handle(0x002e), ioc_status(scsi data underrun)(0x0045), smid(139)
[4331266.475239] mpt3sas_cm0:   request_len(8192), underflow(1024), resid(8192)
[4331266.475241] mpt3sas_cm0:   tag(0), transfer_count(0), sc->result(0x00000000)
[4331266.475244] mpt3sas_cm0:   scsi_status(check condition)(0x02), scsi_state(autosense valid )(0x01)
[4331266.475246] mpt3sas_cm0:   [sense_key,asc,ascq]: [0x06,0x2a,0x01], count(96)
[4331266.475252] sd 2:0:29:0: Mode parameters changed

pool status:

[root@zfs-head ~]# pool status
  pool: storage
 state: ONLINE
  scan: none requested
config:

    NAME                                             STATE     READ WRITE CKSUM
    storage                                          ONLINE       0     0     0
      mirror-0                                       ONLINE       0     0     0
        s1d1                                         ONLINE       0     0     0
        s2d1                                         ONLINE       0     0     0
      mirror-1                                       ONLINE       0     0     0
        s3d1                                         ONLINE       0     0     0
        s4d1                                         ONLINE       0     0     0
      mirror-2                                       ONLINE       0     0     0
        s1d2                                         ONLINE       0     0     0
        s2d2                                         ONLINE       0     0     0
      mirror-3                                       ONLINE       0     0     0
        s3d2                                         ONLINE       0     0     0
        s4d2                                         ONLINE       0     0     0
      mirror-4                                       ONLINE       0     0     0
        s1d3                                         ONLINE       0     0     0
        s2d3                                         ONLINE       0     0     0
      mirror-5                                       ONLINE       0     0     0
        s3d3                                         ONLINE       0     0     0
        s4d3                                         ONLINE       0     0     0
    logs
      ata-Samsung_SSD_850_PRO_128GB_S24ZNXAGA10768M  ONLINE       0     0     0
    cache
      ata-Samsung_SSD_850_EVO_250GB_S21NNXAG918721R  ONLINE       0     0     0
      ata-Samsung_SSD_850_EVO_250GB_S21NNXAGA59337A  ONLINE       0     0     0
      ata-Samsung_SSD_850_EVO_250GB_S21NNXAGA69590F  ONLINE       0     0     0

errors: No known data errors
[root@zfs-head ~]# 

My Vdev map:

[root@zfs-head ~]# cat /etc/zfs/vdev_id.conf
#     by-vdev
#     name     fully qualified or base name of device link
alias s1d1       /dev/disk/by-id/scsi-35000c500837ff247
alias s1d2       /dev/disk/by-id/scsi-35000c500837f15c3
alias s1d3       /dev/disk/by-id/scsi-35000c500837f137f
alias s2d1       /dev/disk/by-id/scsi-35000c500837f377b
alias s2d2       /dev/disk/by-id/scsi-35000c500837f5bf7
alias s2d3       /dev/disk/by-id/scsi-35000c500837f75bf
alias s3d1       /dev/disk/by-id/scsi-35000c500837f14d3
alias s3d2       /dev/disk/by-id/scsi-35000c500837f571b
alias s3d3       /dev/disk/by-id/scsi-35000c500837f604f
alias s4d1       /dev/disk/by-id/scsi-35000c500837f31f3
alias s4d2       /dev/disk/by-id/scsi-35000c500837f25c7
alias s4d3       /dev/disk/by-id/scsi-35000c500837f14cf

[root@zfs-head ~]# 

The box didn't restart, or really even acknowledge that there was an issue, save for the dmesg entries. I have googled those entries to my level best, but did not find anything relevant.

Help appreciated!

user1955162
  • 296
  • 3
  • 16

0 Answers0