Monitor Linux Server filesystem health and suspicious activities

Question

I run my own, small server here. The server runs on Ubuntu 18.04. There is one single HDD using LVM on a partition together with EXT4. LVM is used for taking snapshots. I also use Webmin with Virtualmin for administration.

During the past weeks, I were faced with some strange problems. I run this server for many years and I never had any serious data loss problems except for some rare cases where it was my own mistake.

A few weeks ago i tried to browse to one of my pages and encountered an error message like "the file system needs cleaning".

Ok, I have googled for it and I have run e2fsck on my LVM volume. It found several errors and fixed them. Unfortunately, after fixing these errors, there was a loss of one of the server's web directories. Thanks to my backup concept I was able to restore all data.

The server was up and running again... Some weeks later, I encountered a breach into my WordPress instance due to a bad plugin. I have got the wp-tmp.php malware https://stackoverflow.com/questions/52897669/what-can-do-virus-wp-tmp-php-on-wordpress

After the detection of this breach, I have changed all relevant passwords and moved the whole folder out of the reachability from the web... Due to the fact that every web project is assigned to its own account on the server, I hope that this script (which has shown some javascript to the user) was not able to do a lot of damage...

One week later I just recognized that another directory was completely missing (another user). e2fschk again there were also errors about missing or corrupted inodes that needed to get fixed.

Now I am asking my self the following question:

What can cause such a significant EXT4 data loss?
Can it be related to the fact, that I do LVM snapshots every midnight and backup the snapshot to an external drive? (I have read about problem using LVM and snapshots when there is an HDD Cache enabled)
Are there any monitoring tools for such behaviors? I would like to be able to trace all the things that happened before the files were lost or the EXT4 has gone corrupt... Is there anything like that?

Thank you!

---- Update: 05.10.2020 -------

Here is a syslog excerpt

Oct  1 00:00:09 dtbsrv1 kernel: [565918.456000] EXT4-fs (dm-3): 9 orphan inodes deleted
Oct  1 00:00:09 dtbsrv1 kernel: [565918.456001] EXT4-fs (dm-3): recovery complete
Oct  1 00:00:09 dtbsrv1 kernel: [565918.743753] EXT4-fs (dm-3): mounted filesystem with ordered data mode. Opts: (null)
Oct  1 21:11:54 dtbsrv1 kernel: [642222.440081] EXT4-fs warning (device dm-0): ext4_dirent_csum_verify:367: inode #19925296: comm php-fpm7.2: No space for directory leaf checksum. Please run e2fsck -D.
Oct  1 21:11:54 dtbsrv1 kernel: [642222.440085] EXT4-fs error (device dm-0): ext4_dx_find_entry:1525: inode #19925296: block 25: comm php-fpm7.2: Directory block failed checksum
Oct  1 21:11:54 dtbsrv1 kernel: [642222.686629] EXT4-fs warning (device dm-0): ext4_dirent_csum_verify:367: inode #19925296: comm php-fpm7.2: No space for directory leaf checksum. Please run e2fsck -D.
Oct  1 21:11:54 dtbsrv1 kernel: [642222.686631] EXT4-fs error (device dm-0): ext4_dx_find_entry:1525: inode #19925296: block 25: comm php-fpm7.2: Directory block failed checksum
Oct  1 21:37:01 dtbsrv1 kernel: [643730.020412] EXT4-fs warning (device dm-0): ext4_dirent_csum_verify:367: inode #19925296: comm php-fpm7.2: No space for directory leaf checksum. Please run e2fsck -D.
Oct  1 21:37:01 dtbsrv1 kernel: [643730.020416] EXT4-fs error (device dm-0): ext4_dx_find_entry:1525: inode #19925296: block 24: comm php-fpm7.2: Directory block failed checksum
Oct  1 21:37:02 dtbsrv1 kernel: [643730.244533] EXT4-fs warning (device dm-0): ext4_dirent_csum_verify:367: inode #19925296: comm php-fpm7.2: No space for directory leaf checksum. Please run e2fsck -D.
Oct  1 21:37:02 dtbsrv1 kernel: [643730.244537] EXT4-fs error (device dm-0): ext4_dx_find_entry:1525: inode #19925296: block 24: comm php-fpm7.2: Directory block failed checksum
Oct  1 22:57:24 dtbsrv1 kernel: [648552.977881] EXT4-fs warning (device dm-0): ext4_dirent_csum_verify:367: inode #19925296: comm php-fpm7.2: No space for directory leaf checksum. Please run e2fsck -D.
Oct  1 22:57:24 dtbsrv1 kernel: [648552.977885] EXT4-fs error (device dm-0): ext4_dx_find_entry:1525: inode #19925296: block 1297: comm php-fpm7.2: Directory block failed checksum
Oct  1 22:57:25 dtbsrv1 kernel: [648553.463297] EXT4-fs warning (device dm-0): ext4_dirent_csum_verify:367: inode #19925296: comm php-fpm7.2: No space for directory leaf checksum. Please run e2fsck -D.

This message has occured without any special previous condition.

Here are the SMART results:

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   071   061   006    Pre-fail  Always       -       72097400
  3 Spin_Up_Time            0x0003   099   099   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       51
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   088   060   045    Pre-fail  Always       -       571428862
  9 Power_On_Hours          0x0032   082   082   000    Old_age   Always       -       15825 (102 53 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       51
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   059   049   040    Old_age   Always       -       41 (Min/Max 39/41)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       1
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       33
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       95
194 Temperature_Celsius     0x0022   041   051   000    Old_age   Always       -       41 (0 17 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       15824 (80 28 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       16720897520
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       17531397406
254 Free_Fall_Sensor        0x0032   100   100   000    Old_age   Always       -       0

Its a seagate HDD. Therefore, RAW_RAD_ERROR and Seek_Error_Rate have a special format... See here: https://forums.unraid.net/topic/31038-solved-seagate-with-huge-seek-error-rate-rma/

Here is the df -h output

[srvadmin@dtbsrv1 ~]# df -h
df: /mnt/restic: Transport endpoint is not connected
Filesystem                         Size  Used Avail Use% Mounted on
udev                               3.9G     0  3.9G   0% /dev
tmpfs                              786M  3.7M  782M   1% /run
/dev/mapper/ubuntu--vg-ubuntu--lv  590G  264G  297G  48% /
tmpfs                              3.9G  4.0K  3.9G   1% /dev/shm
tmpfs                              5.0M     0  5.0M   0% /run/lock
tmpfs                              3.9G     0  3.9G   0% /sys/fs/cgroup
/dev/loop0                          97M   97M     0 100% /snap/core/9804
/dev/loop1                          98M   98M     0 100% /snap/core/9993
/dev/sda2                          976M  212M  698M  24% /boot
overlay                            590G  264G  297G  48% /var/lib/docker/overlay2/a404062f5a9eef43425c25238a9f4f82a144d94046ac9addace7e3c70c4934e4/merged
overlay                            590G  264G  297G  48% /var/lib/docker/overlay2/f1a6f65efc0ef172471ff367da1a35a9d7debbcd75229653730a51b7fa30d38e/merged
overlay                            590G  264G  297G  48% /var/lib/docker/overlay2/7c47723744bbdcfe8a2c809cdea9bb52f5fcb17ed22c81e37f37d205776c6237/merged
overlay                            590G  264G  297G  48% /var/lib/docker/overlay2/24fa352fb9497e097754e41dfa22fce703a2067e5668bb692310e3485fa7e106/merged
overlay                            590G  264G  297G  48% /var/lib/docker/overlay2/2c2355d6e3d1eabfb5b0db7a2a85c34b2a5a3056bcc8b574ec3eda2f55549c0b/merged
overlay                            590G  264G  297G  48% /var/lib/docker/overlay2/53335cdd36121c2cb6a28a5bf6e287d6a24501b13eb901f361eeb49f5ff229cd/merged
overlay                            590G  264G  297G  48% /var/lib/docker/overlay2/6f97bdc49212ebb61ea263c27fff544dfb9345bdc12e5f212796d5183e250368/merged
overlay                            590G  264G  297G  48% /var/lib/docker/overlay2/07bb3f1692502da5d202c70abb48c9fbdc8388804170403302d607eacf44c8e1/merged
overlay                            590G  264G  297G  48% /var/lib/docker/overlay2/14ee0758a510385a6ed6dd97c538fea275ff68d5c20c15cb1a4638c2e3b3b243/merged
overlay                            590G  264G  297G  48% /var/lib/docker/overlay2/cd8f7f02d0aad720d5171cd013471d69fabda2b60bdbd0563c5db9f71f2e90cb/merged
overlay                            590G  264G  297G  48% /var/lib/docker/overlay2/59aafbca3b59180e04d992eabfee852c6cbf6d68f9508418985a890a8af3ee62/merged
overlay                            590G  264G  297G  48% /var/lib/docker/overlay2/1c95f755df040e8be68bf556ae2edd06dbc88d4474b1e924adbe9c4572e49679/merged
shm                                 64M     0   64M   0% /var/lib/docker/containers/ba4fcf7feee3a2abccf70fb28cf938800df307e9a607574568a67f61bb0e29f8/mounts/shm
shm                                 64M     0   64M   0% /var/lib/docker/containers/12987bacd86368ec8b55cad121609e0e5495b2de98a189446ea835327708265b/mounts/shm
shm                                 64M     0   64M   0% /var/lib/docker/containers/32784c91bf4662a6b395faab020590a401e38dc7c02271fcfa983a0bcad3c9b5/mounts/shm
shm                                 64M     0   64M   0% /var/lib/docker/containers/8ff1625130226700ef069e228041e1322169fc2146d33b27e1593d81c3c08e6b/mounts/shm
shm                                 64M     0   64M   0% /var/lib/docker/containers/5ba9140321d72192a4ed4b888cee2859261c3dfb49c3e57f60c163f030425f5e/mounts/shm
shm                                 64M     0   64M   0% /var/lib/docker/containers/086463232e2bfe3af088bf123a8f6a9768558bbd9ae2498fbdfbd6f6a3e03894/mounts/shm
shm                                 64M     0   64M   0% /var/lib/docker/containers/5dbce8c1c3456e214d626b0c21be241d936f79153ddc00874350ec3446d904a5/mounts/shm
shm                                 64M     0   64M   0% /var/lib/docker/containers/abc27059eea71599c6a4068c6365c7aae156c69aaba7ed1fc8f44ec405715f60/mounts/shm
shm                                 64M     0   64M   0% /var/lib/docker/containers/edf45d83317d69ffc5909b0e3222d82f25bba6917d5851d573e47517975f3efc/mounts/shm
shm                                 64M     0   64M   0% /var/lib/docker/containers/e802192919f47fb359a726aa1590faf7cb7fbe6340111899ea623f00fdf05e62/mounts/shm
shm                                 64M     0   64M   0% /var/lib/docker/containers/eb4f624f4a0d3ab646d6c03cf5eafb9e2833554ac5ed6319c370babd7ea96957/mounts/shm
shm                                 64M  8.0K   64M   1% /var/lib/docker/containers/2fa8ffce0fe4580d6733e939beec08118a9e4bdbfd695485d7631c9e006b3ddc/mounts/shm
overlay                            590G  264G  297G  48% /var/lib/docker/overlay2/835ed57e1c092dd1b8431618b4ae472f67cf2b867dd3bed987864cff4ddc87e7/merged
shm                                 64M     0   64M   0% /var/lib/docker/containers/b42dfff9ab9b971f175a3d0f8878731be1bea838f580f273aadef7c534f82b73/mounts/shm

Any ideas from your side?

Thanks!

--- Update 2: 05.10.2020 ---- Looks like that someone had a simmilar issue: https://discourse.osmc.tv/t/ext4-dirent-csum-verify-no-space-for-directory-leaf-checksum/75772/15

The solution was: defective SATA -> USB Adapter. In my case this would mean: defective onboard SATA Hardware. Could it be "that simple"?

I have done that several times... snd a few hours or one day later i get the same message again — C. Hediger, Oct 05 '20 at 16:58
Are you quite sure you used `-D`? You didn't mention doing this anywhere in your post. — Michael Hampton, Oct 05 '20 at 17:11

score 2 · Answer 1 · answered Oct 04 '20 at 17:02

Possible you have multiple independent problems at once, compromise of your applications, or failing storage, or something else.

Carefully read How do I deal with a compromised server? Hope is not a plan. In general, the only way to be sure malware is gone is to completely delete the system, and reinstall the OS and applications from known good copies. For example, re-download good copies of all your WordPress plugins. And change all passwords and other credentials. Be very sure you know root cause and the extent of the infection before setting for less.

Regarding storage, check the disk's health attributes, such as with smartctl. On any indication of serious wear, replace it. Even on a single disk not-array, LVM allows you to migrate disks with pvmove. A widely deployed file system like ext4 is well tested, but relies on the storage hardware it is stored on, which will eventually fail. Or, it is possible that malware altered or deleted data, as the extent of that infection does not sound like it was conclusively proven.

Review any backup copies, check if you can see before and after these events. Read logs to see if the kernel printed anything interesting about storage to syslog or the journal. Possible this does not prove anything conclusively, a lot happens to files that just doesn't need to get logged or included in backups.

Should you desire better security and integrity tools, you'll have to search for those yourself. There are whole categories of file integrity monitoring software, either by auditing changes, or verifying integrity of files. WordPress has its own specialty security software and professional consulting, if that is something you wish to purchase.

Thanks for your answer. I have updated my original post in the meantime. It looks that there is some serious ext4 problem.... — C. Hediger, Oct 05 '20 at 09:40

score 0 · Answer 2 · answered Oct 06 '20 at 14:52

I have found the solution:

I figured out, that the specific User on the LVMs logical volume had reached its Quota limit. This was the root cause of all these problems... All FSCHKs were not successfull or only for some hours, cause the quota has filled up again...

It looks like that reaching the quota can break the EXT4 structure....

Good to know for future problems...

Monitor Linux Server filesystem health and suspicious activities

2 Answers2