We have a standalone ESXi5 Server with the follow hardware specs: - Supermicro X8DTL - Intel Xeon(R) CPU E5506 2.13GHz - 25G Ram - 1TB HD (mirrored RAID, local SATA)
We have around 17 VM's running, with ~512MB each. Running web+db servers.
Around a month ago we had the server crash, on investigation we found errors similar to these in the /scratch/log/vobd.log:
2013-02-21T23:30:14.054Z: [scsiCorrelator] 1657239493834us: [vob.scsi.device.io.latency.improved] Device mpx.vmhba2:C0:T0:L0 performance has improved. I/O latency reduced from 1310595 microseconds to 260642 microseconds.
2013-02-21T23:30:17.888Z: [scsiCorrelator] 1657243328201us: [vob.scsi.device.io.latency.improved] Device mpx.vmhba2:C0:T0:L0 performance has improved. I/O latency reduced from 260642 microseconds to 85292 microseconds.
2013-02-21T23:30:39.275Z: [scsiCorrelator] 1657264714482us: [vob.scsi.device.io.latency.high] Device mpx.vmhba2:C0:T0:L0 performance has deteriorated. I/O latency increased from average value of 43610 microseconds to 1310310 microseconds.
2013-02-21T23:30:39.275Z: [scsiCorrelator] 1657263440772us: [esx.problem.scsi.device.io.latency.high] Device mpx.vmhba2:C0:T0:L0 performance has deteriorated. I/O latency increased from average value of 43610 microseconds to 1310310 microseconds.
2013-02-21T23:30:42.796Z: [scsiCorrelator] 1657268235408us: [vob.scsi.device.io.latency.improved] Device mpx.vmhba2:C0:T0:L0 performance has improved. I/O latency reduced from 1310310 microseconds to 257850 microseconds.
2013-02-21T23:30:44.392Z: [scsiCorrelator] 1657269831493us: [vob.scsi.device.io.latency.improved] Device mpx.vmhba2:C0:T0:L0 performance has improved. I/O latency reduced from 257850 microseconds to 86289 microseconds.
2013-02-21T23:32:29.119Z: [scsiCorrelator] 1657374559512us: [vob.scsi.device.io.latency.high] Device mpx.vmhba2:C0:T0:L0 performance has deteriorated. I/O latency increased from average value of 43613 microseconds to 1405607 microseconds.
2013-02-21T23:32:29.120Z: [scsiCorrelator] 1657373285533us: [esx.problem.scsi.device.io.latency.high] Device mpx.vmhba2:C0:T0:L0 performance has deteriorated. I/O latency increased from average value of 43613 microseconds to 1405607 microseconds.
2013-02-21T23:32:35.673Z: [scsiCorrelator] 1657381113191us: [vob.scsi.device.io.latency.improved] Device mp
On the day of the crash we had almost 5000 of these errors, since then we have had as low as 2 per day up to as high as 500 (though no full server crashes). On the guest VM's we are experiencing slowness reading/writing to disk during normal use. Something as simple as a find command on / causes large spikes in the performance chart.
We have replaced both HD's and the RAID controller. A server with identical setup and a similar amount of VM's does not have these issues. Before the first crash (the one with 5k errors) the performance was fine, however logs still show the same error in place ~30-40 times a day. A few days before this crash we did thin provision a large (160GB) HD for a guest VM.
The following is (date,the number of times that error message pops up,average of the latencies logged before the error (MS) and average after.(MS) )
2012-10-24 16 976 138,666
2012-10-28 12 1,020 40,421
2012-11-05 16 1,167 273,223
2012-11-06 20 1,226 89,181
2012-11-07 40 1,314 224,957
2012-11-08 48 1,378 165,349
2012-11-09 42 1,441 174,061
2012-11-10 26 1,519 218,381
2012-11-11 8 1,567 112,229
2012-11-12 24 1,593 233,350
2012-11-13 54 1,641 193,695
2012-11-14 80 1,692 222,456
2012-11-15 32 1,738 243,640
2012-11-16 66 1,776 325,366
2012-11-17 30 1,816 176,468
2012-11-18 38 1,850 264,176
2012-11-20 12 1,846 117,589
2012-11-21 34 1,868 252,732
2012-11-22 44 1,895 166,636
2012-11-23 12 1,926 123,632
2012-11-26 4 1,892 98,791
2012-11-27 14 1,899 184,382
2012-11-28 20 1,916 178,908
2012-11-29 10 1,923 134,338
2012-11-30 6 1,923 69,203
2012-12-01 2 1,924 60,052
2012-12-02 4 1,919 122,631
2012-12-03 8 1,898 126,051
2012-12-04 54 1,909 199,758
2012-12-05 462 2,109 394,950
2012-12-06 36 2,228 191,166
2012-12-07 64 2,245 204,348
2012-12-08 32 2,271 294,890
2012-12-10 140 2,290 302,435
2012-12-11 314 2,386 311,973
2012-12-12 150 2,475 261,258
2012-12-13 160 2,532 236,761
2012-12-14 114 2,585 206,043
2012-12-15 84 2,618 211,221
2012-12-16 52 2,640 256,677
2012-12-17 18 2,637 180,975
2012-12-18 62 2,649 228,785
2012-12-19 92 2,669 199,357
2012-12-20 160 2,707 275,119
2012-12-21 124 2,749 245,460
2012-12-22 2 2,763 102,838
2012-12-26 144 2,736 302,383
2012-12-27 140 2,776 292,725
2012-12-28 64 2,813 274,609
2012-12-30 106 2,811 231,112
2012-12-31 148 2,853 295,416
2013-01-01 12 2,881 204,615
2013-01-04 4 2,860 90,300
2013-01-09 246 2,849 279,765
2013-01-10 278 2,909 301,014
2013-01-11 242 2,966 294,417
2013-01-12 92 3,006 308,232
2013-01-14 248 3,036 271,435
2013-01-15 426 3,172 233,094
2013-01-16 388 3,313 276,185
2013-01-17 342 3,423 282,632
2013-01-18 298 3,517 255,919
2013-01-19 232 3,579 287,905
2013-01-20 8 3,611 128,877
2013-01-21 2 3,614 121,942
2013-01-22 142 3,667 265,338
2013-01-23 402 3,738 281,091
2013-01-24 332 3,826 280,295
2013-01-25 178 3,892 270,747
2013-01-26 280 4,018 319,368
2013-01-27 106 4,075 293,760
2013-01-28 610 4,187 213,410
2013-01-29 784 4,700 222,077
2013-01-30 386 5,236 258,133
2013-01-31 4580 8,261 1,681,902
2013-02-01 2 11,211 339,135
2013-02-02 10 38,909 1,200,144
2013-02-04 18 88,573 2,692,687
2013-02-05 190 67,454 2,094,093
2013-02-06 460 58,534 1,858,435
2013-02-07 98 57,683 1,795,912
2013-02-08 62 54,012 1,671,730
2013-02-09 88 52,681 1,711,773
2013-02-10 66 51,016 1,549,408
2013-02-11 84 48,885 1,639,267
2013-02-12 206 48,364 1,829,969
2013-02-13 562 48,651 1,774,433
2013-02-14 170 48,957 1,655,395
2013-02-15 124 47,055 1,550,294
2013-02-16 140 46,099 1,588,326
2013-02-17 110 45,283 1,485,211
2013-02-18 34 43,836 1,356,562
2013-02-19 326 43,608 1,484,757
2013-02-20 224 43,894 1,581,129
2013-02-21 296 43,626 1,568,687
At this point we are pretty much at a loss, the best answer we have is that since we are using SATA drives (which is probably a terrible idea) we are hitting a big bottleneck. We are planning on moving to a SAN with SAS drives but we want to make sure the problem doesnt follow us.
Thanks