1

Working with a Centos7 with 3.10 kernel, docker 19.03.12.

Eventually, one of the docker images got full and wrote the entire /var/ mount to 100%, crashing both the docker service and the running containers.

Now there are 2 zombie process left that I can't kill (with kill -9 or killall:

ps axjf | grep docker
    1 30215 30215 30215 ?           -1 Ds       0   0:00 [docker-entrypoi]
    1 32063 32063 32063 ?           -1 Zsl      0   0:00 [dockerd] <defunct>

Meanwhile, on /var/log/messages I'm getting:

kernel: XFS (dm-8): Failing async write on buffer block 0xb78170. Retrying async write.
kernel: XFS (dm-8): metadata I/O error: block 0xb78170 ("xfs_buf_iodone_callback_error") error 28 numblks 8

where it seems that some IO is still trying to write some data. This seems to be repeating on an infinite cycle, and I'm not sure how to stop it.

du -sh and ls -al hangs quickly when inspecting the /var/lib/docker files.

Additionally, service docker stop/start also hangs; top reports very high load/wait times (around 23 for a 4 core machine).

My question: without rebooting the machine, what would the best way to cleanly stop the xfs writes, kill the zombie processes and restart the services?

runr
  • 133
  • 2
  • 6

1 Answers1

2

Free up some disk space.

The error reported by the kernel messages you posted is 28, "No space left on device".

Michael Hampton
  • 244,070
  • 43
  • 506
  • 972
  • This didn't change after freeing up the /var. My guess is that the docker service doesn't expand the virtual drive, which it can't due to being stuck? And I can't seem to restart the service due to unresponsive systemctl calls. – runr Aug 27 '20 at 16:49
  • @Nutle You didn't provide any information about how you set up Docker storage, nor about your XFS filesystem. – Michael Hampton Aug 27 '20 at 17:09
  • I don't know the details, since I'm not the one who set it up. This is a standard general purpose machine, with ``docker-ce`` installed with everything by default. What additional info should I provide? The guess about the docker drive comes from the observation, that ``ls -al`` and ``du -sh`` freezes when looking at ``/var/lib/docker/.../devicemapper`` folder - my guess is that this is where the container data is stored. – runr Aug 27 '20 at 17:15
  • @Nutle Devicemapper? That's not a good sign. That was never a production quality storage driver and was never recommended for anything but development or testing, and it isn't very good at those either. I suspect you're going to be rebuilding this box before it's over. Anyway, you could provide information about the Docker storage configuration, filesystems present on the machine (e.g. df), xfs_info, and anything else that might be helpful. – Michael Hampton Aug 27 '20 at 17:33
  • Thanks for the info, I'm not an expert here, but I recalling hearing about CentOS7 overall being too old (kernel-wise) for proper volume support, among other things, and that it would be better to move to Debian or similar for docker. Maybe this is a related issue. If this is gonna result in a rebuild, maybe it's also best to consider moving to Debian (or some other later kernel) that would be more suitable for production? – runr Aug 27 '20 at 17:39
  • @Nutle CentOS 7 is perfectly fine. All the relevant kernel bits have been backported. (And I'd worry about your sanity if you went to Debian for anything.) You should provide the information I requested. – Michael Hampton Aug 27 '20 at 17:41
  • Rebooting the machine seems to have fixed it (xfs and docker were able to recover), took at least 4 hours to boot up again for some reason. This seems like the issue that could easily come up again - instead of a dirty script to monitor and clean the space, is there an alternative configuration for docker's filesystem? You mentioned the use of ``devicemapper`` - how would you advice changing it? I understand that this is beyond the scope for this post, but any hint towards the right direction would really help. Will accept your answer since technically it was the main issue. – runr Aug 28 '20 at 11:05
  • From the ``xfs_info`` on ``/var`` mount, it seems that it is set to ``type=0``, i.e., ``naming =version 2 bsize=4096 ascii-ci=0 ftype=0``. Does this imply that the ``overlay2`` setup wouldn't work, [as per this discussion](https://bugzilla.redhat.com/show_bug.cgi?id=1475625)? – runr Aug 28 '20 at 11:36
  • 1
    @Nutle Yes, overlay2 is what you should be using, and you should also have a long chat with whoever chose the non-default ftype=0 on your XFS filesystem, because you'll have to reformat that too. – Michael Hampton Aug 28 '20 at 18:08