we're running a bunch of gitlabs on a k8s cluster. For persistence, we're using NFS mounts, specifically, we have one NFS export that is shared by all gitlabs. Currently, we have it mounted on all cluster nodes and then bind-mount to the pods, but we also tried one NFS volume, and then mounting subpaths. The issue we're having doesn't change.
What we see happening is this: every now and then (no discernable pattern), some of the gitlabs start to hang (this is mostly those running on one specific node). These gitlabs cannot be started on another node until the node they originally started to hang on is rebooted.
This problematic node then starts to show very high load, and a ton of nfs RPC requests, mostly for two methods: "OpenNoattr" and "TestStateId". The total number of RPC requests from this node goes up by about 20x, and it never goes down until the machine is rebooted.
As mount options, we tried some tweaking, but it had no apparent impact on the problem, we're currently using
rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,local_lock=none
nfsstat reads:
Server rpc stats:
calls badcalls badclnt badauth xdrcall
166383963 19392 0 19392 0
Server nfs v4:
null compound
12 0% 166383895 99%
It kind of looks like for some reason, we're getting stale locks that don't expire until the machine is rebooted. How could this happen? This is all inside a VMWare cluster ... so "bad switch" is unlikely.
Also, i couldn't figure out the meaning of "badauth" and the two RPC methods. Can someone enlighten me?
---- EDIT: some details -----
uname -a
Linux [all machines] 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
rpm -qa |grep nfs-utils
nfs-utils-1.3.0-0.61.el7.x86_64
we have since come to the conclusion that it most probably is related to the postgresql inside the container, and this was mentioned in an advisory from gitlab, which recommended to do
sysctl -w fs.leases-enable=0
on the server side as a workaround. we did this.