I have a simple server setup:
2 NVME SSD disks (both SAMSUNG MZVLB1T0HALR-00000 for 1TB) united into RAID0.
OS Ubuntu 19.04
Today my system stopped responding. Reboot didn't help. I connected via KVM and noticed these error messages at booting screen:
md/raid0:md0: too few disks (1 of 2) - aborting!
md: pers->run() failed ...
mdadm: failed to start array /dev/md/0: Invalid argument
md/raid1:md1: active with 1 out of 2 mirrors
md1: detected capacity change from 0 to 536281088
md/raid0:md2: too few disks (1 or 2) - aborting!
md: pers->run() failed ...
mdadm: failed to start array /dev/md/2: Invalid argument
Then booted to the rescue system and tried to check the disks for errors but I couldn't find the 2nd disk. There was only /dev/nvme0
but no /dev/nvme1
.
I wrote to technical support (my server is at Hetzner) and asked them to check the disks for me. They shut down the server for a minute, then turned it on and were able to see the 2nd disk in rescue system.
They checked both drives for errors and 1st one showed some SMART errors:
sudo nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning : 0
temperature : 33 C
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 21%
data_units_read : 279,672,974
data_units_written : 366,481,283
host_read_commands : 2,479,016,466
host_write_commands : 2,637,293,356
controller_busy_time : 19,928
power_cycles : 10
power_on_hours : 5,153
unsafe_shutdowns : 4
media_errors : 21
num_err_log_entries : 26
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1 : 33 C
Temperature Sensor 2 : 39 C
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0
They told me the disk looked failed and needed to be replaced. Of course all data is to be lost.
I tried to simply reboot the system once again (because they managed to connect the 2nd disk back) and the system loaded normally!
Then I tried to read error log with nvme error-log
command but it shows only "SUCCESS" entries:
sudo nvme error-log /dev/nvme0
Error Log Entries for device:nvme0 entries:64
.................
Entry[ 0]
.................
error_count : 0
sqid : 0
cmdid : 0
status_field : 0(SUCCESS: The command completed successfully)
parm_err_loc : 0
lba : 0
nsid : 0
vs : 0
cs : 0
.................
Entry[ 1]
.................
error_count : 0
sqid : 0
cmdid : 0
status_field : 0(SUCCESS: The command completed successfully)
parm_err_loc : 0
lba : 0
nsid : 0
vs : 0
cs : 0
...and so on
The system seems working normally. I don't know what was that. But for some reason one of the disks suddenly stopped and didn't want to start until full reboot with a pause was done.
Now I'm wondering - is there a way to read the actual error log? And test the disks to ensure that it really needs to be replaced?