0

So, We have a supermicro server with the next hardware configuration:

CentOS 7 Softraid (because this supermicro configuration didn't support hardware raid..) / Partition is RAID 10 and the rest one is RAID 1

CPU: 2x AMD EPYC 7402 RAM: 512Gb DDR4 (16x 32Gb) 10x 2TB Intel SSD DC P4510 NVMe.

This server is a shared hosting one with CloudLinux, cPanel, etc.

Every 2 days, we receive the following error in the console:

Oct 18 23:11:19 toranaga kernel: XFS (snumbd4d): log I/O error -5

Oct 18 23:11:19 toranaga kernel: XFS (snumbd4d): Log I/O Error Detected. Shutting down filesystem

Oct 18 23:11:20 toranaga kernel: XFS (snumbd3d): log I/O error -5

Oct 18 23:11:20 toranaga kernel: XFS (snumbd3d): Log I/O Error Detected. Shutting down filesystem

Oct 18 23:11:20 toranaga kernel: XFS (snumbd1d): log I/O error -5

Oct 18 23:11:20 toranaga kernel: XFS (snumbd1d): Log I/O Error Detected. Shutting down filesystem

Oct 20 16:01:54 toranaga kernel: XFS (snumbd8d): log I/O error -5

Oct 20 16:01:54 toranaga kernel: XFS (snumbd8d): Log I/O Error Detected. Shutting down filesystem

Oct 20 16:01:54 toranaga kernel: XFS (snumbd2d): log I/O error -5

Oct 20 16:01:54 toranaga kernel: XFS (snumbd2d): Log I/O Error Detected. Shutting down filesystem

Oct 20 16:02:02 toranaga kernel: XFS (snumbd6d): metadata I/O error in "xfs_read_agf+0x8e/0x120 [xfs]" at daddr 0x423e1d801 len 1 error 5

Oct 20 16:02:02 toranaga kernel: XFS (snumbd6d): log I/O error -5

Oct 20 16:02:02 toranaga kernel: XFS (snumbd6d): Log I/O Error Detected. Shutting down filesystem

Oct 20 16:02:05 toranaga kernel: XFS (snumbd7d): log I/O error -5

Oct 20 16:02:05 toranaga kernel: XFS (snumbd7d): Log I/O Error Detected. Shutting down filesystem

Some advice what should we do? Thanks!

  • Does the device node for the xfs volume, i.e. /dev/mdXAY, vanish when you get these errors. If a xfs filesystem is above 85% full, the volume has trouble committing the log metadata, and so the file system log (xfs, zfs, ext4, btrfs are all logging file systems, whereas ext2 is not) fills up with commits. That in turn causes the kernel cache buffers to fail to commit. So, you've g – Brian Sep 30 '22 at 15:21

1 Answers1

0

Does the device node for the xfs volume, i.e. /dev/mdXAY, vanish when you get these errors. If a xfs filesystem is above 85% full, the volume has trouble committing the log metadata, and so the file system log (xfs, zfs, ext4, btrfs are all logging file systems, whereas ext2 is not) fills up with commits. That in turn causes the kernel cache buffers to fail to commit. So, you've got a load of uncommitted metadata that has nowhere to go. The xfs file system will shut down, and remove the device node, to preserve the metadata in the log from loss.

Presently, I am working on moving the internal xfs log that resides inside the volume, to a large external file, external logs being a feature of xfs. I just haven't figured out if the log can be moved after creation of the filesystem.

I made some progress by umounting, remounting, then,if successful to that point, umounting and running xfs_repair -L device to zero the log. That worked, and I finished the rsync transfer I was working on (local Debian mirror update).

But the problem recurred the next update, and there is way too much space taken on the drive, about 400GB too much. I'm going to try to safely zero the logs again, and trim the NVMe drive. If that makes no difference in unused space, I'll try to relocate the log.

Interestingly enough, after successfully mounting, umounting, log zeroing, remounting, running sync will crash the filesystem the same way, after a few minutes. So, that suggests the cache buffers are loaded with commits.

Brian
  • 101
  • 1