1

I tried to get more information from the error log of an SAS disk by running the following command that prints values and descriptions of the SAS (SSP) Protocol Specific log page.

# smartctl -d scsi -l sasphy /dev/sg1
Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 79
  number of phys = 1
  phy identifier = 0
    attached device type: end device
    attached reason: power on
    reason: loss of dword synchronization   <======================== (?)
    negotiated logical link rate: phy enabled; 6 Gbps
    attached initiator port: ssp=1 stp=1 smp=1
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000...
    attached SAS address = 0x5b8...
    attached phy identifier = 6
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 24194   <======================== (?)
    Phy reset problem = 0
...
relative target port id = 2
  generation code = 79
  number of phys = 1
  phy identifier = 1
    attached device type: no device attached
    attached reason: unknown
    reason: unknown
    negotiated logical link rate: phy enabled; 1.5 Gbps
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=0
    attached SAS address = 0x0
    attached phy identifier = 0
...

From the above, I note that there is a high loss of DWORD synchronization, which, according to IBM, is an error that occurs when a PHY stops detecting an incoming stream of DWORDs. I tried to search for further information regarding this error but can't seem to find any.

How does loss of DWORD synchronization affect the health of an SAS disk? Do I need to worry about it? And at what threshold level should I be monitoring it?

Question Overflow
  • 2,103
  • 7
  • 30
  • 45
  • I Believe it would affect performance... (though I'm not Sure, I could be wrong) Check [Link]http://www.seagate.com/staticfiles/support/disc/manuals/Interface%20manuals/100293071c.pdf Page 157... Quote "While dword synchronization is lost, the data stream re ceived is invalid and dwords are not be passed to the link layer".. Are you having performance issues, Did you try contacting the vendor? – vijay rajah Jul 24 '14 at 06:13
  • @vijayrajah, thanks for the info. I don't have issues with my disks right now. The number doesn't seem to be moving up at the moment but I am not sure if this could be a potential problem. – Question Overflow Jul 25 '14 at 07:24
  • Are you using a RAID controller? Are you using any form of software RAID? – ewwhite Aug 19 '14 at 18:04
  • @ewwhite, I am using RAID1 with a Dell PERC H200 controller. – Question Overflow Aug 20 '14 at 03:29

1 Answers1

2

This error doesn't affect the health of the drive itself. If you will move the drive to another chassis that doesn't have the link problem the drive will be fine. That is assuming that the link problems do not originate from the drive port itself.

These errors mean that there is a problem in the link between the drive and the upstream port, if you have a cable in there the cable may be bad, if not it means one of the port is bad. Ofcourse even if you have a cable it can still mean one of the ports is bad.

The way to diagnose it is to use a different disk in the same slot and see if the error goes away or not, if it went away the disk is bad. If the error stayed the original disk is fine but the port on the server/chassis is bad and the server/chassis needs to be replaced.

The issue with loss of dword synchronization is that it means additional retries for some sent IOs and it will increase the latency of IOs by way of waiting more for data transmission due to these retransmits. In severe cases task aborts may get sent and even target resets as part of the error recovery which will make the drive inaccessible for many seconds and may cause filesystems to fail or a raid to drop the disk.

Baruch Even
  • 1,073
  • 6
  • 18
  • I notice that the loss of DWORD synchronisation count goes up whenever I reboot the server. As both my RAID1 disks are having the same problem, does it mean that the server chassis is the source of the problem? And what monitoring level do you propose? As the server is running fine so far, if I want to ask for replacement from my dedicated server provider, I would need to be able to justify my case. It would be helpful if you can provide some references. Thanks. – Question Overflow Aug 20 '14 at 03:41
  • There is no real way to know if the issue is in the chassis or otherwise without switching hardware around. If you only see the counters going up around startup that's less of an issue. If it goes up while the system is up and running it's more of an issue. – Baruch Even Aug 21 '14 at 17:02