Do I need to replace my NVME SSD?

Question

I have a simple server setup:

2 NVME SSD disks (both SAMSUNG MZVLB1T0HALR-00000 for 1TB) united into RAID0.

OS Ubuntu 19.04

Today my system stopped responding. Reboot didn't help. I connected via KVM and noticed these error messages at booting screen:

md/raid0:md0: too few disks (1 of 2) - aborting!
md: pers->run() failed ...
mdadm: failed to start array /dev/md/0: Invalid argument
md/raid1:md1: active with 1 out of 2 mirrors
md1: detected capacity change from 0 to 536281088
md/raid0:md2: too few disks (1 or 2) - aborting!
md: pers->run() failed ...
mdadm: failed to start array /dev/md/2: Invalid argument

Then booted to the rescue system and tried to check the disks for errors but I couldn't find the 2nd disk. There was only /dev/nvme0 but no /dev/nvme1.

I wrote to technical support (my server is at Hetzner) and asked them to check the disks for me. They shut down the server for a minute, then turned it on and were able to see the 2nd disk in rescue system.

They checked both drives for errors and 1st one showed some SMART errors:

sudo nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning                    : 0
temperature                         : 33 C
available_spare                     : 100%
available_spare_threshold           : 10%
percentage_used                     : 21%
data_units_read                     : 279,672,974
data_units_written                  : 366,481,283
host_read_commands                  : 2,479,016,466
host_write_commands                 : 2,637,293,356
controller_busy_time                : 19,928
power_cycles                        : 10
power_on_hours                      : 5,153
unsafe_shutdowns                    : 4
media_errors                        : 21
num_err_log_entries                 : 26
Warning Temperature Time            : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1                : 33 C
Temperature Sensor 2                : 39 C
Thermal Management T1 Trans Count   : 0
Thermal Management T2 Trans Count   : 0
Thermal Management T1 Total Time    : 0
Thermal Management T2 Total Time    : 0

They told me the disk looked failed and needed to be replaced. Of course all data is to be lost.

I tried to simply reboot the system once again (because they managed to connect the 2nd disk back) and the system loaded normally!

Then I tried to read error log with nvme error-log command but it shows only "SUCCESS" entries:

sudo nvme error-log /dev/nvme0
Error Log Entries for device:nvme0 entries:64
.................
 Entry[ 0]
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS: The command completed successfully)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
cs           : 0
.................
 Entry[ 1]
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS: The command completed successfully)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
cs           : 0
...and so on

The system seems working normally. I don't know what was that. But for some reason one of the disks suddenly stopped and didn't want to start until full reboot with a pause was done.

Now I'm wondering - is there a way to read the actual error log? And test the disks to ensure that it really needs to be replaced?

The PM981 is not a server-spec part, it's specifically designed for PC and workstation loads, not server ones. Also why are you using RAID 0? This site is for IT professionals, this is an unprofessional system. — Chopper3, Jun 23 '20 at 15:22
@Michael they actually used `nvme smart-log` and then said the data was lost even though the system loaded successfully — Stalinko, Jun 23 '20 at 18:38
@chopper3 I'm not a devops specialist, I just needed a fast enough server for reasonable money, so hetzner server with 2 nmve's was ideal for my needs. Raid0 because it gives best speed for smallest money. However now i think it's worth adding a 3rd disk and migrating over Raid5. — Stalinko, Jun 23 '20 at 18:43
You don't have to be a devops specialist, just make professional choices, here's one, don't use R5, it's been a dead technology for a LONG time, R1/10 or R6/60 are the only games in town - maybe RAID-Z if that floats your boat. When choosing server-spec flash storage the key requirment is endurance, often measures in DWPD (daily writes per day), if you go for one with too low a value you'll kill your drive/s. — Chopper3, Jun 24 '20 at 14:22
@Chopper3 "just make professional choices" - sounds obvious but unless you're a specialist with many years of experience it's very hard to figure out what is the professional choice. Even what you wrote. I've been reading many articles about RAIDs but you're the first who says RAID5 is a dead technology. I'll take that into account but it requires bigger budget. RAID1 doesn't suit us because of small capacity, 10-6-60 require more than 3 disks which wasn't in our plan) thanks for the tip anyway — Stalinko, Jun 25 '20 at 04:49
It's something well known in pro-IT circles - sorry you've missed that - https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwiNiJur8ZzqAhUFYsAKHWu_CKAQFjABegQIARAB&url=https%3A%2F%2Fwww.zdnet.com%2Farticle%2Fwhy-raid-5-stops-working-in-2009%2F&usg=AOvVaw2G4VsQSr55mk7JYHuyL-64 — Chopper3, Jun 25 '20 at 11:42

score 0 · Answer 1 · answered Dec 03 '20 at 21:48

Things should be fine, if the system works as expected and no other errors are reported.

According to NVM Express Management Interface description: "A Response Message Status value other that Success indicates that an error occurred [...]"

Obviously, the disk BIOS is using the ERROR log to report SUCCESS!

Do I need to replace my NVME SSD?

1 Answers1