I run my own, small server here. The server runs on Ubuntu 18.04. There is one single HDD using LVM on a partition together with EXT4. LVM is used for taking snapshots. I also use Webmin with Virtualmin for administration.
During the past weeks, I were faced with some strange problems. I run this server for many years and I never had any serious data loss problems except for some rare cases where it was my own mistake.
A few weeks ago i tried to browse to one of my pages and encountered an error message like "the file system needs cleaning".
Ok, I have googled for it and I have run e2fsck on my LVM volume. It found several errors and fixed them. Unfortunately, after fixing these errors, there was a loss of one of the server's web directories. Thanks to my backup concept I was able to restore all data.
The server was up and running again... Some weeks later, I encountered a breach into my WordPress instance due to a bad plugin. I have got the wp-tmp.php malware https://stackoverflow.com/questions/52897669/what-can-do-virus-wp-tmp-php-on-wordpress
After the detection of this breach, I have changed all relevant passwords and moved the whole folder out of the reachability from the web... Due to the fact that every web project is assigned to its own account on the server, I hope that this script (which has shown some javascript to the user) was not able to do a lot of damage...
One week later I just recognized that another directory was completely missing (another user). e2fschk again there were also errors about missing or corrupted inodes that needed to get fixed.
Now I am asking my self the following question:
- What can cause such a significant EXT4 data loss?
- Can it be related to the fact, that I do LVM snapshots every midnight and backup the snapshot to an external drive? (I have read about problem using LVM and snapshots when there is an HDD Cache enabled)
- Are there any monitoring tools for such behaviors? I would like to be able to trace all the things that happened before the files were lost or the EXT4 has gone corrupt... Is there anything like that?
Thank you!
---- Update: 05.10.2020 -------
Here is a syslog excerpt
Oct 1 00:00:09 dtbsrv1 kernel: [565918.456000] EXT4-fs (dm-3): 9 orphan inodes deleted
Oct 1 00:00:09 dtbsrv1 kernel: [565918.456001] EXT4-fs (dm-3): recovery complete
Oct 1 00:00:09 dtbsrv1 kernel: [565918.743753] EXT4-fs (dm-3): mounted filesystem with ordered data mode. Opts: (null)
Oct 1 21:11:54 dtbsrv1 kernel: [642222.440081] EXT4-fs warning (device dm-0): ext4_dirent_csum_verify:367: inode #19925296: comm php-fpm7.2: No space for directory leaf checksum. Please run e2fsck -D.
Oct 1 21:11:54 dtbsrv1 kernel: [642222.440085] EXT4-fs error (device dm-0): ext4_dx_find_entry:1525: inode #19925296: block 25: comm php-fpm7.2: Directory block failed checksum
Oct 1 21:11:54 dtbsrv1 kernel: [642222.686629] EXT4-fs warning (device dm-0): ext4_dirent_csum_verify:367: inode #19925296: comm php-fpm7.2: No space for directory leaf checksum. Please run e2fsck -D.
Oct 1 21:11:54 dtbsrv1 kernel: [642222.686631] EXT4-fs error (device dm-0): ext4_dx_find_entry:1525: inode #19925296: block 25: comm php-fpm7.2: Directory block failed checksum
Oct 1 21:37:01 dtbsrv1 kernel: [643730.020412] EXT4-fs warning (device dm-0): ext4_dirent_csum_verify:367: inode #19925296: comm php-fpm7.2: No space for directory leaf checksum. Please run e2fsck -D.
Oct 1 21:37:01 dtbsrv1 kernel: [643730.020416] EXT4-fs error (device dm-0): ext4_dx_find_entry:1525: inode #19925296: block 24: comm php-fpm7.2: Directory block failed checksum
Oct 1 21:37:02 dtbsrv1 kernel: [643730.244533] EXT4-fs warning (device dm-0): ext4_dirent_csum_verify:367: inode #19925296: comm php-fpm7.2: No space for directory leaf checksum. Please run e2fsck -D.
Oct 1 21:37:02 dtbsrv1 kernel: [643730.244537] EXT4-fs error (device dm-0): ext4_dx_find_entry:1525: inode #19925296: block 24: comm php-fpm7.2: Directory block failed checksum
Oct 1 22:57:24 dtbsrv1 kernel: [648552.977881] EXT4-fs warning (device dm-0): ext4_dirent_csum_verify:367: inode #19925296: comm php-fpm7.2: No space for directory leaf checksum. Please run e2fsck -D.
Oct 1 22:57:24 dtbsrv1 kernel: [648552.977885] EXT4-fs error (device dm-0): ext4_dx_find_entry:1525: inode #19925296: block 1297: comm php-fpm7.2: Directory block failed checksum
Oct 1 22:57:25 dtbsrv1 kernel: [648553.463297] EXT4-fs warning (device dm-0): ext4_dirent_csum_verify:367: inode #19925296: comm php-fpm7.2: No space for directory leaf checksum. Please run e2fsck -D.
This message has occured without any special previous condition.
Here are the SMART results:
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 071 061 006 Pre-fail Always - 72097400
3 Spin_Up_Time 0x0003 099 099 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 51
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 088 060 045 Pre-fail Always - 571428862
9 Power_On_Hours 0x0032 082 082 000 Old_age Always - 15825 (102 53 0)
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 51
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 059 049 040 Old_age Always - 41 (Min/Max 39/41)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 1
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 33
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 95
194 Temperature_Celsius 0x0022 041 051 000 Old_age Always - 41 (0 17 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 15824 (80 28 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 16720897520
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 17531397406
254 Free_Fall_Sensor 0x0032 100 100 000 Old_age Always - 0
Its a seagate HDD. Therefore, RAW_RAD_ERROR and Seek_Error_Rate have a special format... See here: https://forums.unraid.net/topic/31038-solved-seagate-with-huge-seek-error-rate-rma/
Here is the df -h output
[srvadmin@dtbsrv1 ~]# df -h
df: /mnt/restic: Transport endpoint is not connected
Filesystem Size Used Avail Use% Mounted on
udev 3.9G 0 3.9G 0% /dev
tmpfs 786M 3.7M 782M 1% /run
/dev/mapper/ubuntu--vg-ubuntu--lv 590G 264G 297G 48% /
tmpfs 3.9G 4.0K 3.9G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup
/dev/loop0 97M 97M 0 100% /snap/core/9804
/dev/loop1 98M 98M 0 100% /snap/core/9993
/dev/sda2 976M 212M 698M 24% /boot
overlay 590G 264G 297G 48% /var/lib/docker/overlay2/a404062f5a9eef43425c25238a9f4f82a144d94046ac9addace7e3c70c4934e4/merged
overlay 590G 264G 297G 48% /var/lib/docker/overlay2/f1a6f65efc0ef172471ff367da1a35a9d7debbcd75229653730a51b7fa30d38e/merged
overlay 590G 264G 297G 48% /var/lib/docker/overlay2/7c47723744bbdcfe8a2c809cdea9bb52f5fcb17ed22c81e37f37d205776c6237/merged
overlay 590G 264G 297G 48% /var/lib/docker/overlay2/24fa352fb9497e097754e41dfa22fce703a2067e5668bb692310e3485fa7e106/merged
overlay 590G 264G 297G 48% /var/lib/docker/overlay2/2c2355d6e3d1eabfb5b0db7a2a85c34b2a5a3056bcc8b574ec3eda2f55549c0b/merged
overlay 590G 264G 297G 48% /var/lib/docker/overlay2/53335cdd36121c2cb6a28a5bf6e287d6a24501b13eb901f361eeb49f5ff229cd/merged
overlay 590G 264G 297G 48% /var/lib/docker/overlay2/6f97bdc49212ebb61ea263c27fff544dfb9345bdc12e5f212796d5183e250368/merged
overlay 590G 264G 297G 48% /var/lib/docker/overlay2/07bb3f1692502da5d202c70abb48c9fbdc8388804170403302d607eacf44c8e1/merged
overlay 590G 264G 297G 48% /var/lib/docker/overlay2/14ee0758a510385a6ed6dd97c538fea275ff68d5c20c15cb1a4638c2e3b3b243/merged
overlay 590G 264G 297G 48% /var/lib/docker/overlay2/cd8f7f02d0aad720d5171cd013471d69fabda2b60bdbd0563c5db9f71f2e90cb/merged
overlay 590G 264G 297G 48% /var/lib/docker/overlay2/59aafbca3b59180e04d992eabfee852c6cbf6d68f9508418985a890a8af3ee62/merged
overlay 590G 264G 297G 48% /var/lib/docker/overlay2/1c95f755df040e8be68bf556ae2edd06dbc88d4474b1e924adbe9c4572e49679/merged
shm 64M 0 64M 0% /var/lib/docker/containers/ba4fcf7feee3a2abccf70fb28cf938800df307e9a607574568a67f61bb0e29f8/mounts/shm
shm 64M 0 64M 0% /var/lib/docker/containers/12987bacd86368ec8b55cad121609e0e5495b2de98a189446ea835327708265b/mounts/shm
shm 64M 0 64M 0% /var/lib/docker/containers/32784c91bf4662a6b395faab020590a401e38dc7c02271fcfa983a0bcad3c9b5/mounts/shm
shm 64M 0 64M 0% /var/lib/docker/containers/8ff1625130226700ef069e228041e1322169fc2146d33b27e1593d81c3c08e6b/mounts/shm
shm 64M 0 64M 0% /var/lib/docker/containers/5ba9140321d72192a4ed4b888cee2859261c3dfb49c3e57f60c163f030425f5e/mounts/shm
shm 64M 0 64M 0% /var/lib/docker/containers/086463232e2bfe3af088bf123a8f6a9768558bbd9ae2498fbdfbd6f6a3e03894/mounts/shm
shm 64M 0 64M 0% /var/lib/docker/containers/5dbce8c1c3456e214d626b0c21be241d936f79153ddc00874350ec3446d904a5/mounts/shm
shm 64M 0 64M 0% /var/lib/docker/containers/abc27059eea71599c6a4068c6365c7aae156c69aaba7ed1fc8f44ec405715f60/mounts/shm
shm 64M 0 64M 0% /var/lib/docker/containers/edf45d83317d69ffc5909b0e3222d82f25bba6917d5851d573e47517975f3efc/mounts/shm
shm 64M 0 64M 0% /var/lib/docker/containers/e802192919f47fb359a726aa1590faf7cb7fbe6340111899ea623f00fdf05e62/mounts/shm
shm 64M 0 64M 0% /var/lib/docker/containers/eb4f624f4a0d3ab646d6c03cf5eafb9e2833554ac5ed6319c370babd7ea96957/mounts/shm
shm 64M 8.0K 64M 1% /var/lib/docker/containers/2fa8ffce0fe4580d6733e939beec08118a9e4bdbfd695485d7631c9e006b3ddc/mounts/shm
overlay 590G 264G 297G 48% /var/lib/docker/overlay2/835ed57e1c092dd1b8431618b4ae472f67cf2b867dd3bed987864cff4ddc87e7/merged
shm 64M 0 64M 0% /var/lib/docker/containers/b42dfff9ab9b971f175a3d0f8878731be1bea838f580f273aadef7c534f82b73/mounts/shm
Any ideas from your side?
Thanks!
--- Update 2: 05.10.2020 ---- Looks like that someone had a simmilar issue: https://discourse.osmc.tv/t/ext4-dirent-csum-verify-no-space-for-directory-leaf-checksum/75772/15
The solution was: defective SATA -> USB Adapter. In my case this would mean: defective onboard SATA Hardware. Could it be "that simple"?