1

On a SuperMicro 1U Box with four drives in RAID 10

A week ago it threw the toys around with three critical errors

Controller encountered a fatal error and was reset

VD is now DEGRADED VD 0

Consistency Check failed on VD (at 04:19:21)

It then proceeded to rebuild and almost exactly an hour later the rebuild completed and Drive Cache settings restored after rebuild on PD

Now, at the same time ever morning (04:19:47) it reports

Controller ID: 0 PD Predictive Failure: Port 0 - 3:0:3

I have seen many predictive failures in the past, but generally these are sporadic and increase over time until the drive is replaced or fails. Happening the same time every day does suggest (to me at least) that something else is happening.

Any ideas?

Thanks

================ EDIT ================

Ran a consistency check (25/09) and nothing popped up

================ EDIT 2 ================

Thanks to GapSF I have downloaded the StorCLI64 Utility and it does look some bad blocks on one drive

9/19/22  4:19:20: C0:Bad Block Count for LD 0 is 0
09/19/22  4:19:47: C0:EVT#565406-09/19/22  4:19:47:  96=Predictive failure: PD 04(e0xfc/s3)
09/20/22  4:19:20: C0:Bad Block Count for LD 0 is 0
09/20/22  4:19:47: C0:EVT#565407-09/20/22  4:19:47:  96=Predictive failure: PD 04(e0xfc/s3)
09/21/22  4:19:20: C0:Bad Block Count for LD 0 is 0
09/21/22  4:19:47: C0:EVT#565408-09/21/22  4:19:47:  96=Predictive failure: PD 04(e0xfc/s3)
09/22/22  4:19:20: C0:Bad Block Count for LD 0 is 0
09/22/22  4:19:47: C0:EVT#565409-09/22/22  4:19:47:  96=Predictive failure: PD 04(e0xfc/s3)
09/23/22  4:19:20: C0:Bad Block Count for LD 0 is 0
09/23/22  4:19:47: C0:EVT#565410-09/23/22  4:19:47:  96=Predictive failure: PD 04(e0xfc/s3)
gchq
  • 363
  • 1
  • 4
  • 15
  • check termlog with storcli: storcli /c0 show termlog type=contents. – gapsf Sep 25 '22 at 16:51
  • Also check media error count on physical discs: `storcli /c0/eall/sall show all` – gapsf Sep 25 '22 at 16:58
  • Thank you for replying. Currently using RWC2 and RWC3 (RWC2 does seem a lot better than RWC3) - Am I correct in assuming that StorCLI is software I need to download from LSI and run via a command prompt? – gchq Sep 25 '22 at 18:49
  • A dont know what is rwc. if your raid controller is lsi/avago currently broadcom - yes its storcli from broadcom site – gapsf Sep 25 '22 at 18:55
  • I found what is rwc it is lsi megaraid storage manager originally – gapsf Sep 25 '22 at 19:02
  • Intel, ibm,lenovo controllers is also lsi oem – gapsf Sep 25 '22 at 19:04
  • If you use esxi you need install it on esxi host not on guest vm – gapsf Sep 25 '22 at 19:05
  • RWC is the Intel tool (RAID Web Console 2 and 3). Controller is Avago MegaRAID SAS-4i – gchq Sep 25 '22 at 19:08
  • Avago buy lsi and broadcom buy avago so its lsi originally. https://www.broadcom.com/support/download-search?pg=Storage+Adapters,+Controllers,+and+ICs&pf=Storage+Adapters,+Controllers,+and+ICs&pn=&pa=&po=&dk=Storcli&pl=&l=false Management Software and Tools latest storcli all os – gapsf Sep 25 '22 at 19:12
  • Downloaded and ran StorCLI64 - there is a lot of data to wade through, but it does seem to point to some bad blocks on one drive. I assume that 'LD 0 is 0' is the drive on port 0 connected to controller 0? Not sure what PD 04 is. – gchq Sep 25 '22 at 19:51
  • e0xfc/s3 - disk in slot 3. When number of bad blocks will over trashold controller mark it as failed. and then you need to replace it. Until now its just ok. Check termlog for media sense error, double errors, punctured blocks – gapsf Sep 25 '22 at 19:57
  • Something like https://support.lenovo.com/us/en/solutions/ht504153-recovering-serveraid-unrecoverable-medium-errors-lenovo-system-3250-m4 – gapsf Sep 25 '22 at 20:00
  • Ideally termlog should be clean from errors. Run consistency check and view termlog. Termlog has fixed size - older records are owerwitten by new one. Note date time in termlog it in UTC – gapsf Sep 25 '22 at 20:03
  • Thank you so much for your help :-) Total for that drive is Media Error Count 367, Other Error Count 6, Predictive Failure Count 6. It's odd that RWC shows this as slot 0 and this as slot 3 – gchq Sep 25 '22 at 20:04
  • Media error 367 is not much. Keep eyes on trends. You may check serial numbers from storcli with numbers on disks to be sure is it slot 3. Also you may on/off led blinking on disk or slot to identify it. Compare output of infirmation about phys disks with termlog you will see sX is a slot number – gapsf Sep 25 '22 at 20:18
  • storcli /c0/e0/s3 should blink slit #3 – gapsf Sep 25 '22 at 20:21
  • https://docs.broadcom.com/doc/12352476 – gapsf Sep 25 '22 at 20:24

0 Answers0