1

We have a standalone ESXi5 Server with the follow hardware specs: - Supermicro X8DTL - Intel Xeon(R) CPU E5506 2.13GHz - 25G Ram - 1TB HD (mirrored RAID, local SATA)

We have around 17 VM's running, with ~512MB each. Running web+db servers.

Around a month ago we had the server crash, on investigation we found errors similar to these in the /scratch/log/vobd.log:

2013-02-21T23:30:14.054Z: [scsiCorrelator] 1657239493834us: [vob.scsi.device.io.latency.improved] Device mpx.vmhba2:C0:T0:L0 performance has improved. I/O latency reduced from 1310595 microseconds to 260642 microseconds.
2013-02-21T23:30:17.888Z: [scsiCorrelator] 1657243328201us: [vob.scsi.device.io.latency.improved] Device mpx.vmhba2:C0:T0:L0 performance has improved. I/O latency reduced from 260642 microseconds to 85292 microseconds.
2013-02-21T23:30:39.275Z: [scsiCorrelator] 1657264714482us: [vob.scsi.device.io.latency.high] Device mpx.vmhba2:C0:T0:L0 performance has deteriorated. I/O latency increased from average value of 43610 microseconds to 1310310 microseconds.
2013-02-21T23:30:39.275Z: [scsiCorrelator] 1657263440772us: [esx.problem.scsi.device.io.latency.high] Device mpx.vmhba2:C0:T0:L0 performance has deteriorated. I/O latency increased from average value of 43610 microseconds to 1310310 microseconds.
2013-02-21T23:30:42.796Z: [scsiCorrelator] 1657268235408us: [vob.scsi.device.io.latency.improved] Device mpx.vmhba2:C0:T0:L0 performance has improved. I/O latency reduced from 1310310 microseconds to 257850 microseconds.
2013-02-21T23:30:44.392Z: [scsiCorrelator] 1657269831493us: [vob.scsi.device.io.latency.improved] Device mpx.vmhba2:C0:T0:L0 performance has improved. I/O latency reduced from 257850 microseconds to 86289 microseconds.
2013-02-21T23:32:29.119Z: [scsiCorrelator] 1657374559512us: [vob.scsi.device.io.latency.high] Device mpx.vmhba2:C0:T0:L0 performance has deteriorated. I/O latency increased from average value of 43613 microseconds to 1405607 microseconds.
2013-02-21T23:32:29.120Z: [scsiCorrelator] 1657373285533us: [esx.problem.scsi.device.io.latency.high] Device mpx.vmhba2:C0:T0:L0 performance has deteriorated. I/O latency increased from average value of 43613 microseconds to 1405607 microseconds.
2013-02-21T23:32:35.673Z: [scsiCorrelator] 1657381113191us: [vob.scsi.device.io.latency.improved] Device mp

On the day of the crash we had almost 5000 of these errors, since then we have had as low as 2 per day up to as high as 500 (though no full server crashes). On the guest VM's we are experiencing slowness reading/writing to disk during normal use. Something as simple as a find command on / causes large spikes in the performance chart.

We have replaced both HD's and the RAID controller. A server with identical setup and a similar amount of VM's does not have these issues. Before the first crash (the one with 5k errors) the performance was fine, however logs still show the same error in place ~30-40 times a day. A few days before this crash we did thin provision a large (160GB) HD for a guest VM.

The following is (date,the number of times that error message pops up,average of the latencies logged before the error (MS) and average after.(MS) )

2012-10-24    16           976     138,666
2012-10-28    12         1,020      40,421
2012-11-05    16         1,167     273,223
2012-11-06    20         1,226      89,181
2012-11-07    40         1,314     224,957
2012-11-08    48         1,378     165,349
2012-11-09    42         1,441     174,061
2012-11-10    26         1,519     218,381
2012-11-11     8         1,567     112,229
2012-11-12    24         1,593     233,350
2012-11-13    54         1,641     193,695
2012-11-14    80         1,692     222,456
2012-11-15    32         1,738     243,640
2012-11-16    66         1,776     325,366
2012-11-17    30         1,816     176,468
2012-11-18    38         1,850     264,176
2012-11-20    12         1,846     117,589
2012-11-21    34         1,868     252,732
2012-11-22    44         1,895     166,636
2012-11-23    12         1,926     123,632
2012-11-26     4         1,892      98,791
2012-11-27    14         1,899     184,382
2012-11-28    20         1,916     178,908
2012-11-29    10         1,923     134,338
2012-11-30     6         1,923      69,203
2012-12-01     2         1,924      60,052
2012-12-02     4         1,919     122,631
2012-12-03     8         1,898     126,051
2012-12-04    54         1,909     199,758
2012-12-05   462         2,109     394,950
2012-12-06    36         2,228     191,166
2012-12-07    64         2,245     204,348
2012-12-08    32         2,271     294,890
2012-12-10   140         2,290     302,435
2012-12-11   314         2,386     311,973
2012-12-12   150         2,475     261,258
2012-12-13   160         2,532     236,761
2012-12-14   114         2,585     206,043
2012-12-15    84         2,618     211,221
2012-12-16    52         2,640     256,677
2012-12-17    18         2,637     180,975
2012-12-18    62         2,649     228,785
2012-12-19    92         2,669     199,357
2012-12-20   160         2,707     275,119
2012-12-21   124         2,749     245,460
2012-12-22     2         2,763     102,838
2012-12-26   144         2,736     302,383
2012-12-27   140         2,776     292,725
2012-12-28    64         2,813     274,609
2012-12-30   106         2,811     231,112
2012-12-31   148         2,853     295,416
2013-01-01    12         2,881     204,615
2013-01-04     4         2,860      90,300
2013-01-09   246         2,849     279,765
2013-01-10   278         2,909     301,014
2013-01-11   242         2,966     294,417
2013-01-12    92         3,006     308,232
2013-01-14   248         3,036     271,435
2013-01-15   426         3,172     233,094
2013-01-16   388         3,313     276,185
2013-01-17   342         3,423     282,632
2013-01-18   298         3,517     255,919
2013-01-19   232         3,579     287,905
2013-01-20     8         3,611     128,877
2013-01-21     2         3,614     121,942
2013-01-22   142         3,667     265,338
2013-01-23   402         3,738     281,091
2013-01-24   332         3,826     280,295
2013-01-25   178         3,892     270,747
2013-01-26   280         4,018     319,368
2013-01-27   106         4,075     293,760
2013-01-28   610         4,187     213,410
2013-01-29   784         4,700     222,077
2013-01-30   386         5,236     258,133
2013-01-31  4580         8,261   1,681,902
2013-02-01     2        11,211     339,135
2013-02-02    10        38,909   1,200,144
2013-02-04    18        88,573   2,692,687
2013-02-05   190        67,454   2,094,093
2013-02-06   460        58,534   1,858,435
2013-02-07    98        57,683   1,795,912
2013-02-08    62        54,012   1,671,730
2013-02-09    88        52,681   1,711,773
2013-02-10    66        51,016   1,549,408
2013-02-11    84        48,885   1,639,267
2013-02-12   206        48,364   1,829,969
2013-02-13   562        48,651   1,774,433
2013-02-14   170        48,957   1,655,395
2013-02-15   124        47,055   1,550,294
2013-02-16   140        46,099   1,588,326
2013-02-17   110        45,283   1,485,211
2013-02-18    34        43,836   1,356,562
2013-02-19   326        43,608   1,484,757
2013-02-20   224        43,894   1,581,129
2013-02-21   296        43,626   1,568,687

At this point we are pretty much at a loss, the best answer we have is that since we are using SATA drives (which is probably a terrible idea) we are hitting a big bottleneck. We are planning on moving to a SAN with SAS drives but we want to make sure the problem doesnt follow us.

Thanks

Pratik Amin
  • 3,303
  • 3
  • 22
  • 19
  • 4
    Supermicro... SATA... bummer. – ewwhite Feb 22 '13 at 00:19
  • 3
    @ewwhite seconded, like using a motorbike to transport tanks – Chopper3 Feb 22 '13 at 00:26
  • I think this question might be related to the problems you are having: http://serverfault.com/questions/231496/vmware-esxi-looking-for-bottlenecks?rq=1. They are running multiple VMs on a single server, with two *SAS* drives, as seeing problems. – funkaoshi Feb 22 '13 at 15:37

2 Answers2

4

Honestly, you may have solved your own problem!

  • You've identified the effects of the issue... and a possible source.
  • You've verified that it can work on a similar setup.
  • You've observed bad behavior on a single machine.
  • You did NOT replace the chassis or backplane. Your issues probably lie there.
  • You bought Supermicro, which does not have the same level of polish or quality-control consistency as IBM, HP or Dell's offerings.

This happens. Replace the server and move on.

ewwhite
  • 197,159
  • 92
  • 443
  • 809
1

Not really a question, but...

It's possible that your RAID controller switched to write-through mode. One reason could be a faulty BBU (or it's learn cycle). This can reduce performance greatly.

Roman
  • 3,907
  • 3
  • 21
  • 34