0

I'm currently running a VM instance of Debian 8.6 on a Hyper-V environment. The VM is used for a number of cron tasks which copy files from one mounted machine to our NAS and also runs some md5 checksums on those copied files.

The problem I'm experiencing is that once every couple of weeks, the filesystem appears to get corrupted and so my cronjobs stop executing. If I try to edit the crontab via crontab -e, I get the following error:

/tmp/crontab.Vvp59T: Input/output error
Creation of temporary crontab file failed - aborting

I also notice certain commands fail to be recognized:

root@srv-schl-008:/home/ilienert# dmesg | lpr
bash: lpr: command not found

I then tried to run a file system check but nothing worked. Here's a log of what I did:

root@srv-schl-008:/home/ilienert# parted /dev/sda 'print'
Model: Msft Virtual Disk (scsi)
Disk /dev/sda: 53.7GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags:
Number  Start   End     Size    File system     Name  Flags
 1      1049kB  538MB   537MB   fat32                 boot, esp
 2      538MB   51.5GB  51.0GB  ext4
 3      51.5GB  53.7GB  2145MB  linux-swap(v1)
root@srv-schl-008:/home/ilienert# fsck /dev/sda2
fsck from util-linux 2.25.2
e2fsck 1.42.12 (29-Aug-2014)
/dev/sda2: recovering journal
fsck.ext4: Bad magic number in super-block while trying to re-open /dev/sda2
/dev/sda2: ********** WARNING: Filesystem still has errors **********
root@srv-schl-008:/home/ilienert# echo $? # status of last command
12

Finally, when we reboot the VM we see this splash. And, at this point we are forced to restore the VM from an earlier state, from which point it works fine for another couple of weeks before failing again. Any idea why this keeps happening?

Here are the specs of our environment:

Debian version: 8.6.0
Hyper-V is running on two clustered servers with Windows Server 2016 Datacenter
Hardware specs Hyper-V hosts: 2x E5-2650v2 8-Cores, 256GB DDR3 RAM
Backup solution: Veeam Backup & Replication 9.5
VM specs:
Clustered: yes
Generation: 2
Dynamic memory: 16834MB
Number of virtual processors: 4
Hard disk type: VHDX

EDIT: Here is a screenshot I dug up of a previous time when I was able to execute dmesg right after the failure: dmesg output

BGSGunterGlut
  • 37
  • 1
  • 5

1 Answers1

0

It is hard to say what is going wrong (breaking things with time) just from the information you posted. I am afraid you will have to do some more investigating. Here are two things that come to mind:

1) We cannot discard the possibility that your physical hard drive where the virtual machine is hosted is starting to fail. I would recommend to check the disk on its own (not via a virtual machine) if possible.

2) After you restore your machine, keep an eye everyday on the main system logs, they might contain (error) messages that may help to explain what is going wrong with the machine. I normally would run some bash commands to check for the main logs and append results to a file, like this (you may not have all those logs on your machine, some may need packages installed you don't have, feel free to adapt for your own use):

echo '---------Dmesg-----------' >> /var/log/mylog.txt
dmesg --level=emerg,alert,crit,err >> /var/log/mylog.txt 2>&1
echo '---------systemctl-----------' >> /var/log/mylog.txt
systemctl --state=failed --all -q >> /var/log/mylog.txt 2>&1
echo '---------grep Xorg.0.log-----------' >> /var/log/mylog.txt
grep -wi 'error\|fail\|fault\|corruption\|hung\|lockup\|unkown\|segmentation\|critical\|missing\|(EE)' /var/log/Xorg.0.log >> /var/log/mylog.txt 2>&1
echo '---------journalctl - the last report-----------' >> /var/log/mylog.txt
journalctl -q --boot -0 --priority=3 >> /var/log/mylog.txt 2>&1
echo '---------grep var/log/boot.log -----------------' >> /var/log/mylog.txt
grep -wi 'error\|fail\|fault\|corruption\|hung\|lockup\|unkown\|segmentation\|critical\|missing' /var/log/boot.log >> /var/log/mylog.txt 2>&1
echo '---------end of mylog reports-----------' >> /var/log/mylog.txt

I would also recommend to reboot the machine every day until you find the problem.

I hope this helps!

Jamil Said
  • 2,033
  • 3
  • 15
  • 18
  • Thanks a lot. Please see my edit which contains some more relevant information. – BGSGunterGlut May 03 '17 at 08:59
  • I would still recommend the same, with some additional suggestions: try to restore the machine and run `fsck` immediately after the restore -- if it succeeds to correct problems (exit code 1, see fsck man), then I would make that state the new state for the machine to be restored to, and start using the machine again and hope for the best. Again, I would have some suspicion of this hard drive health, and run an extensive test on it from a rescue system boot if the problems persist. If all this fails, check the logs everyday and investigate error by error that you find there. Good luck! – Jamil Said May 03 '17 at 19:39