2

Our dedicated server's hard disk was recently diagnosed with bad-sectors, and each time certain data on the disk are about to get accessed, the whole server goes non-responding before i i issue a restart over robot panel. We asked our server provider to install a new disk drive, and they did. Now i want to hard-copy everything onto the new disk (as the old disk is also still attached), so i start the server in rescue mode (network boot) and run the following on the network-booted server via SSH (root):

ddrescue -d -f -r3 /dev/sdb /dev/sda /home/ddrescue.log

after about 5 minutes the server goes non-responding, and not even responding to SSH (as if the port is closed).

What could cause it? how can i prevent the server from going bananas when some faulty sectors are about to get accessed.

chakmeshma
  • 121
  • 2
  • 2
    What file system do you use? Otherwise, maybe `man fsck` and use `fsck`? – ETL Aug 11 '18 at 22:58
  • The filesystem is EXT4 – chakmeshma Aug 11 '18 at 23:02
  • The symptoms you describe would not be caused by a bad sector. If the bad sector was on the disk the system was running from it could plausibly crash sshd. But with a bad sector on a disk you aren't actually running the system it would at most cause disk I/O to stall while it was trying to read the bad sector, but it wouldn't cause sshd to crash. And with sshd running from a rescue image in RAM it wouldn't even cause sshd to stall. If you accessed the disk using the `ext4` code in the kernel it can be configured to cause a panic on errors. But `ddrescue` wouldn't do that. – kasperd Aug 12 '18 at 18:24
  • So why does the whole server freeze each time dd is past (in my case) 8,2 GB – chakmeshma Aug 12 '18 at 18:27

2 Answers2

1

You should try enabling TLER - time limited error recovery

Without it, a disk with bad sectors will try reading the affected for 30+ second, possibly crashing the entire disk subsystem.

shodanshok
  • 47,711
  • 7
  • 111
  • 180
0

From the looks of it your hardware/driver/whatever freezes when it encounters a bad block and you cannot proceed with the backup.

Do you have a list of bad blocks?

How about doing a logical backup (with tar, for example)?

The way I would approach this, but haven't tested:

  • somehow get a list of bad blocks (fsck.ext4 with -c and/or -l)?
  • having the list of bad blocks, find files that are affected using debugfs:
    icheck block ...
          Print a listing of the inodes which use the one or  more  blocks
          specified on the command line.
  • create a logical backup with tar --exclude...

Good luck. :-s

Karol Nowak
  • 234
  • 1
  • 5
  • I think the freeze occurs apparently everytime those bad-blocks are about to get accessed/read, so i think fsck.ext4 would also cause it to freeze when reaching those blocks, right? – chakmeshma Aug 12 '18 at 11:49
  • Maybe. Do you get bad block information in ```dmesg```? Perhaps you could use that on the next run to skip them. – Karol Nowak Aug 12 '18 at 11:51
  • the whole server freezes completely, i won't be able to do anything run any command when it happens, unless i issue a hardware reset, and because its a network boot, all data is volatile and gone after reset – chakmeshma Aug 12 '18 at 11:56
  • "Freezes completely" as in you lose network connection? Does it still ping? That's strange, but... do you have access to the kernel log in the admin panel? – Karol Nowak Aug 12 '18 at 11:59
  • now i ran fsck, gets this: "ext2fs_open2: Bad magic number in super-block" – chakmeshma Aug 12 '18 at 12:00
  • yes i lose connection after freeze – chakmeshma Aug 12 '18 at 12:01
  • Any chance the filesystem gets mounted with ```errors=panic```? Take a look at the ```sb=n``` mount option / ```-b``` ```fsck``` option. – Karol Nowak Aug 12 '18 at 12:03
  • the fstab looks normal without any errors=panic – chakmeshma Aug 12 '18 at 12:27