NFS issues with K8S

Question

we're running a bunch of gitlabs on a k8s cluster. For persistence, we're using NFS mounts, specifically, we have one NFS export that is shared by all gitlabs. Currently, we have it mounted on all cluster nodes and then bind-mount to the pods, but we also tried one NFS volume, and then mounting subpaths. The issue we're having doesn't change.

What we see happening is this: every now and then (no discernable pattern), some of the gitlabs start to hang (this is mostly those running on one specific node). These gitlabs cannot be started on another node until the node they originally started to hang on is rebooted.

This problematic node then starts to show very high load, and a ton of nfs RPC requests, mostly for two methods: "OpenNoattr" and "TestStateId". The total number of RPC requests from this node goes up by about 20x, and it never goes down until the machine is rebooted.

As mount options, we tried some tweaking, but it had no apparent impact on the problem, we're currently using

rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,local_lock=none

nfsstat reads:

Server rpc stats:
calls      badcalls   badclnt    badauth    xdrcall
166383963   19392      0          19392      0

Server nfs v4:
null         compound
12        0% 166383895 99%

It kind of looks like for some reason, we're getting stale locks that don't expire until the machine is rebooted. How could this happen? This is all inside a VMWare cluster ... so "bad switch" is unlikely.

Also, i couldn't figure out the meaning of "badauth" and the two RPC methods. Can someone enlighten me?

---- EDIT: some details -----

uname -a
Linux [all machines] 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

rpm -qa |grep nfs-utils
nfs-utils-1.3.0-0.61.el7.x86_64

we have since come to the conclusion that it most probably is related to the postgresql inside the container, and this was mentioned in an advisory from gitlab, which recommended to do

sysctl -w fs.leases-enable=0

on the server side as a workaround. we did this.

with NFSv4.x you can run into number of issues where client and server loose shared state and state recovery goes into infinite loop. Can you update the question with OS and versions of clients and nfs server hosts? — kofemann, Sep 04 '19 at 20:00

NFS issues with K8S

0 Answers0