Bad motherboard / controller / HDs?

Question

On a leased server, I am running into some timing issues with an application that requires precise timing. Server is a Dual Xeon E5410 running on a Supermicro X7DVL-3 motherboard under CentOs 5.5 x64.

The application I am running is timer sensitive and keeps sensing drift whether under load or at idle, but especially under load. I did some investigating with atop and dd and found some mind-blowing numbers. Mind you, I am no Linux guru but something sure seems out of whack.

I ran:

dd bs=4096 if=/dev/zero of=/bigtestfile

to generate disk activity. Regardless whether I wrote it to sda or sdb my DSK value in atop would go over 100%, at one time peaking at 1700%. Again it does not matter if I am writing to sda or sdb.

DSK |         sdb | busy    675% | read       0 | write    110 | avio   78 ms |

Here are the smartctl outputs:

# smartctl -A /dev/sda
smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0007   165   165   021    Pre-fail  Always       -       2750
  4 Start_Stop_Count        0x0032   100   100   040    Old_age   Always       -       21
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000a   200   200   051    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   065   065   000    Old_age   Always       -       25831
 10 Spin_Retry_Count        0x0012   100   253   051    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0012   100   253   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       21
194 Temperature_Celsius     0x0022   116   093   000    Old_age   Always       -       27
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0012   200   200   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   051    Old_age   Offline      -       0


# smartctl -A /dev/sdb
smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   180   180   021    Pre-fail  Always       -       3958
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       22
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   200   200   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   068   068   000    Old_age   Always       -       24087
 10 Spin_Retry_Count        0x0013   100   253   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0013   100   253   051    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       21
194 Temperature_Celsius     0x0022   122   096   000    Old_age   Always       -       25
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       0

Any idea what's wrong here? Bad motherboard? It would seem rare that both drives are going bad (smartctl says they PASS_, so it leaves the mobo as the culprit in my eyes.

score 0 · Answer 1 · answered Mar 05 '11 at 05:38

This is strange. As a long shot, I'd try first reseating the cables, then replacing them if reseating them doesn't work.

I've seen HDDs start to get sick with bad sectors and really laggy performance. The odds of two drives going bad at the same time lend itself to the controller or motherboard as you mentioned.

If at all possible, I'd try removing one drive at a time and run your tests again to see if the performance issues are still present with either alone, or if it only occurs with both.

Good luck.

Thanks, will have to get the remote hands at the DC to try removing the extra drive and see if it makes any difference. — quidpro, Mar 05 '11 at 18:41

score 0 · Accepted Answer · answered Mar 05 '11 at 05:47

Some drift is inevitable. Clock discipline provided by things like NTP help smooth it out. Linux has a selection of timers in use, and some are vulnerable to load-related drift. Disk I/O causing drift is not surprising in a two disk system as it's possible that the storage controller and time controller are on the same southbridge chip.

The HPET timer is more precise, but does require correction to stay true to UTC. More precise timers will require software to make sure time doesn't drift (ntp for instance) or special hardware.

As for the excessive DSK time I have seen instances where IOWAIT climbs to insane levels. That's a result of the disk subsystem not being able to keep up with demand, and your dd command is designed to throw a lot of data at the disk in a short period of time. In a two disk system this seems... unusual. I'm suspecting a bad data-path somewhere in the motherboard's firmware; hardware faults should leave screaming traces in dmesg.

Thanks for the info. I can see DSK going to 100% being normal... but going to over 100% (ie. 1700%) is that a normal possibility under a "perfect" Linux system? — quidpro, Mar 05 '11 at 16:43
Initial tests with NTPDATE show a jitter of 0.1s... over time this climbs and after one test the jitter was 0.6s. So definitely looks like a clocking issue with the mobo. — quidpro, Mar 06 '11 at 01:41

Bad motherboard / controller / HDs?

2 Answers2