3

I have a strange error message in the logs, which started like this:

:39:35 host1 kernel: [54674279.243416] mpt2sas0: fault_state(0x2651)!
:39:35 host1 kernel: [54674279.243543] mpt2sas0: sending diag reset !!
:39:36 host1 kernel: [54674280.481215] mpt2sas0: diag reset: SUCCESS
:39:36 host1 kernel: [54674280.713443] mpt2sas0: LSISAS2008: FWVersion(07.15.08.00), ChipRevision(0x03), BiosVersion(07.02.03.00)
:39:36 host1 kernel: [54674280.713451] mpt2sas0: Dell 6Gbps SAS HBA: Vendor(0x1000), Device(0x0072), SSVID(0x1028), SSDID(0x1F1C)
:39:36 host1 kernel: [54674280.713455] mpt2sas0: Protocol=(Initiator,Target), Capabilities=(Raid,TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set Full,NCQ)
:39:36 host1 kernel: [54674280.713518] mpt2sas0: sending port enable !!
:39:43 host1 kernel: [54674287.616666] mpt2sas0: port enable: SUCCESS
:39:43 host1 kernel: [54674287.616814] mpt2sas0: search for end-devices: start
:39:43 host1 kernel: [54674287.617657] scsi target7:0:3: handle(0x0009), sas_addr(0x590b11c410294314), enclosure logical id(0x590b11c007729400), slot(7)
:39:43 host1 kernel: [54674287.617735] scsi target7:0:2: handle(0x000a), sas_addr(0x590b11c41025f914), enclosure logical id(0x590b11c007729400), slot(3)
:39:43 host1 kernel: [54674287.617807] mpt2sas0: search for end-devices: complete
:39:43 host1 kernel: [54674287.617810] mpt2sas0: search for raid volumes: start
:39:43 host1 kernel: [54674287.617813] mpt2sas0: search for responding raid volumes: complete
:39:43 host1 kernel: [54674287.617816] mpt2sas0: search for expanders: start
:39:43 host1 kernel: [54674287.617818] mpt2sas0: search for expanders: complete
:39:43 host1 kernel: [54674287.617833] mpt2sas0: search for end-devices: start
:39:43 host1 kernel: [54674287.618468] scsi target7:0:3: handle(0x0009), sas_addr(0x590b11c410294314), enclosure logical id(0x590b11c007729400), slot(7)
:39:43 host1 kernel: [54674287.618543] scsi target7:0:2: handle(0x000a), sas_addr(0x590b11c41025f914), enclosure logical id(0x590b11c007729400), slot(3)
:39:43 host1 kernel: [54674287.618614] mpt2sas0: search for end-devices: complete
:39:43 host1 kernel: [54674287.618617] mpt2sas0: search for raid volumes: start
:39:43 host1 kernel: [54674287.618619] mpt2sas0: search for responding raid volumes: complete
:39:43 host1 kernel: [54674287.618622] mpt2sas0: search for expanders: start
:39:43 host1 kernel: [54674287.618624] mpt2sas0: search for expanders: complete
:39:43 host1 kernel: [54674287.618632] mpt2sas0: _base_fault_reset_work: hard reset: success
:39:43 host1 kernel: [54674287.618639] mpt2sas0: removing unresponding devices: start
:39:43 host1 kernel: [54674287.618642] mpt2sas0: removing unresponding devices: complete
:39:43 host1 kernel: [54674287.618654] mpt2sas0: scan devices: start
:39:43 host1 kernel: [54674287.619530] mpt2sas0: failure at /build/buildd/linux-3.2.0/drivers/scsi/mpt2sas/mpt2sas_scsih.c:5157/_scsih_add_device()!
:39:43 host1 kernel: [54674287.619866] mpt2sas0: failure at /build/buildd/linux-3.2.0/drivers/scsi/mpt2sas/mpt2sas_scsih.c:5157/_scsih_add_device()!

and the last message is repeated many times a second. Other info considered relevant:

This is a Dell machine with aged Linux kernel connected with SAS to Dell disk array.

# uname -a
Linux host1 3.2.0-34-generic #53-Ubuntu SMP Thu Nov 15 10:48:16 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

# modinfo -F version mpt2sas 
10.100.00.00

lspci | grep LSI 
01:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2008 [Falcon] (rev 03) 
08:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)

When more debug added to mpt2sas, this is the result:

 mpt2sas0: failure at /build/buildd/linux-3.2.0/drivers/scsi/mpt2sas/mpt2sas_scsih.c:5157/_scsih_add_device()! 
  phy-7:4: refresh: parent sas_addr(0x590b11c007729400), 
       link_rate(0x08), phy(4) 
       attached_handle(0x0000), sas_addr(0x0000000000000000)

Other machines, connected to different volumes of the disk array work normally. Disk array and iDrac provide no clues in the logs, anything seems normal. Googling provided some horror stories that the RAID can eventually drop all disks. The problem is not connected with unusually high load.

The behaviour continues for hours.

Red Hat seems to have very similar question, but no solution(?) yet:

https://access.redhat.com/solutions/1990653

Unfortunately, I can't reboot the machine to perform experiments.

Roman Susi
  • 141
  • 5

0 Answers0