1

I am facing a problem with kubernetes nodes deployed on AWS. (Cluster with 3 nodes and 1 master running on m3.large instances with each about 25GB)

After (about 3 days) there is 0KB left on disk and the cluster get stuck.

All the storage (more or less) is used by /var/lib/docker/overlay/. Inside this folder are about 500 or more of those files:

drwx------ 3 root root 4096 Jun 20 15:33 ed4f90bd7a64806f9917e995a02974ac69883a06933033ffd5049dd31c13427a
drwx------ 3 root root 4096 Jun 20 15:28 ee9344fea422c38d71fdd2793ed517c5693b0c8a890964e6932befa1ebe5aa63
drwx------ 3 root root 4096 Jun 20 16:17 efed310a549243e730e9796e558b2ae282e07ea3ce0840a50c0917a435893d42
drwx------ 3 root root 4096 Jun 20 14:39 eff7f04f17c0f96cff496734fdc1903758af1dfdcd46011f6c3362c73c6086c2
drwx------ 3 root root 4096 Jun 20 15:29 f5bfb696f5a6cad888f7042d01bfe146c0621517c124d58d76e77683efa1034e
drwx------ 3 root root 4096 Jun 20 15:26 f5fa9d5d2066c7fc1c8f80970669634886dcaccc9e73ada33c7c250845d2fe8c
drwx------ 3 root root 4096 Jun 20 14:38 f8fd64fb1e0ab26708d5458dddd2d5a70018034237dfed3db48ada5666fcf77f
drwx------ 3 root root 4096 Jun 20 14:46 faa143ebd7a4079eaa45ddbf17dcfc9163e3035983f2e334e32a60e89452fa94
drwx------ 3 root root 4096 Jun 20 14:48 fb93c0c64e0d4935bf67fc8f70df2b8a4cffe59e294ee8a876dfdf6b57486da5
drwx------ 3 root root 4096 Jun 20 14:46 fd0a420d5655fb7d022c397effdb95968ff7e722c58fcc7915f97e8df47cd080

Cluster runs on Kubernetes 1.6.4 and Docker 1.12.6.

Seems to be a problem with the garbage collector of kubernetes. Running cAdvisor /validate gives me following message

 None of the devices support 'cfq' I/O scheduler. No disk stats can be reported.
     Disk "xvda" Scheduler type "none".

Running those statement journalctl -u kubelet | grep -i garbage gives also a error message: Jun 20 14:35:21 ip-172-21-4-239 kubelet[1551]: E0620 14:35:21.986898 1551 kubelet.go:1165] Image garbage collection failed: unable to find data for container /

Any ideas how to get the garbage collector working again?

nrhode
  • 913
  • 1
  • 9
  • 27
  • They are directories, what's the size of each one? `du -sh /var/lib/docker/overlay/*` and the dir itself? `du -sh /var/lib/docker/overlay/` – Robert Jun 22 '17 at 12:13
  • The file size are not always the same. (There are a lot of files about a few Kilobytes but there are also a plenty of files with 710MB, 197MB, 17MB and so on...) (In sum 14GB) The folder is as well 14GB. – nrhode Jun 23 '17 at 05:41
  • Are you running kops? And if so, which version? Some of the older versions of kops don't have all the GC settings configured, so you may just need to upgrade kops. If you let me know which version of kops you're running I can confirm whether there are fixes in later versions. – justinsb Jul 06 '17 at 16:54

1 Answers1

2

I was able to resolve a similar issue with recurring high IO on the nodes due to du -s /var/lib/docker/overlay/ by editing kops cluster.spec with kops edit cluster [cluster_name]. I added the following under specs:

docker:
    logDriver: json-file
    logLevel: warn
    storage: overlay2

It looks like by default kops configures docker to use overlay as default storage driver, while docker recommends using overlay2 as newer, more stable and faster.

Tim Diekmann
  • 7,755
  • 11
  • 41
  • 69
Mark Hilton
  • 286
  • 3
  • 5