Bad Sector on disk causes the whole server to crash

Question

Our dedicated server's hard disk was recently diagnosed with bad-sectors, and each time certain data on the disk are about to get accessed, the whole server goes non-responding before i i issue a restart over robot panel. We asked our server provider to install a new disk drive, and they did. Now i want to hard-copy everything onto the new disk (as the old disk is also still attached), so i start the server in rescue mode (network boot) and run the following on the network-booted server via SSH (root):

ddrescue -d -f -r3 /dev/sdb /dev/sda /home/ddrescue.log

after about 5 minutes the server goes non-responding, and not even responding to SSH (as if the port is closed).

What could cause it? how can i prevent the server from going bananas when some faulty sectors are about to get accessed.

What file system do you use? Otherwise, maybe `man fsck` and use `fsck`? — ETL, Aug 11 '18 at 22:58
The symptoms you describe would not be caused by a bad sector. If the bad sector was on the disk the system was running from it could plausibly crash sshd. But with a bad sector on a disk you aren't actually running the system it would at most cause disk I/O to stall while it was trying to read the bad sector, but it wouldn't cause sshd to crash. And with sshd running from a rescue image in RAM it wouldn't even cause sshd to stall. If you accessed the disk using the `ext4` code in the kernel it can be configured to cause a panic on errors. But `ddrescue` wouldn't do that. — kasperd, Aug 12 '18 at 18:24
So why does the whole server freeze each time dd is past (in my case) 8,2 GB — chakmeshma, Aug 12 '18 at 18:27

score 1 · Answer 1 · answered Aug 12 '18 at 12:33

1

You should try enabling TLER - time limited error recovery

Without it, a disk with bad sectors will try reading the affected for 30+ second, possibly crashing the entire disk subsystem.

answered Aug 12 '18 at 12:33

shodanshok

47,711
7
111
180

Apparently it's already enabled on the disk, `smartctl -l scterc /dev/sdb` returns 70,70 (7 seconds for read and write) – chakmeshma Aug 12 '18 at 13:35
Try to lower it (ie: 2 or 3 seconds) – shodanshok Aug 12 '18 at 15:00

score 0 · Answer 2 · answered Aug 12 '18 at 11:41

0

From the looks of it your hardware/driver/whatever freezes when it encounters a bad block and you cannot proceed with the backup.

Do you have a list of bad blocks?

How about doing a logical backup (with tar, for example)?

The way I would approach this, but haven't tested:

somehow get a list of bad blocks (fsck.ext4 with -c and/or -l)?
having the list of bad blocks, find files that are affected using debugfs:

    icheck block ...
          Print a listing of the inodes which use the one or  more  blocks
          specified on the command line.

create a logical backup with tar --exclude...

Good luck. :-s

answered Aug 12 '18 at 11:41

Karol Nowak

234
1
5

I think the freeze occurs apparently everytime those bad-blocks are about to get accessed/read, so i think fsck.ext4 would also cause it to freeze when reaching those blocks, right? – chakmeshma Aug 12 '18 at 11:49
Maybe. Do you get bad block information in ```dmesg```? Perhaps you could use that on the next run to skip them. – Karol Nowak Aug 12 '18 at 11:51
the whole server freezes completely, i won't be able to do anything run any command when it happens, unless i issue a hardware reset, and because its a network boot, all data is volatile and gone after reset – chakmeshma Aug 12 '18 at 11:56
"Freezes completely" as in you lose network connection? Does it still ping? That's strange, but... do you have access to the kernel log in the admin panel? – Karol Nowak Aug 12 '18 at 11:59
now i ran fsck, gets this: "ext2fs_open2: Bad magic number in super-block" – chakmeshma Aug 12 '18 at 12:00
yes i lose connection after freeze – chakmeshma Aug 12 '18 at 12:01
Any chance the filesystem gets mounted with ```errors=panic```? Take a look at the ```sb=n``` mount option / ```-b``` ```fsck``` option. – Karol Nowak Aug 12 '18 at 12:03
the fstab looks normal without any errors=panic – chakmeshma Aug 12 '18 at 12:27

Bad Sector on disk causes the whole server to crash

2 Answers2