I run a 3-node glusterfs 3.10 cluster based on Heketi to automatically provision and deprovision storage via Kubernetes. Currently, there are 20 volumes active - most with the minimum allowed size of 10gb, but each having only a few hundred mb of data persisted. Each volume is replicated on two nodes (equivalent of a RAID-1).
However, the gluster processes on the nodes take up huge amounts of memory (~13gb) on each node. Creating a statedump and looking at the result, the volumes each use between 1 and 30mb of memory:
# for i in $(gluster volume list); do gluster volume statedump $i nfs; done
# grep mallinfo_uordblks -hn *.dump.*
11:mallinfo_uordblks=1959056
11:mallinfo_uordblks=20888896
11:mallinfo_uordblks=2793760
11:mallinfo_uordblks=23316944
11:mallinfo_uordblks=1917536
11:mallinfo_uordblks=29287872
11:mallinfo_uordblks=14807280
11:mallinfo_uordblks=2170592
11:mallinfo_uordblks=2077088
11:mallinfo_uordblks=15463760
11:mallinfo_uordblks=2030032
11:mallinfo_uordblks=2079856
11:mallinfo_uordblks=2079920
11:mallinfo_uordblks=2167808
11:mallinfo_uordblks=2396160
11:mallinfo_uordblks=34000240
11:mallinfo_uordblks=2649920
11:mallinfo_uordblks=1683776
11:mallinfo_uordblks=6316944
All volumes have the default settings for performance. For some reason, the cache-size is shown twice - once with 32mb and once with 128mb:
# gluster volume get <volumeId> all | grep performance | sort
performance.cache-capability-xattrs true
performance.cache-ima-xattrs true
performance.cache-invalidation false
performance.cache-max-file-size 0
performance.cache-min-file-size 0
performance.cache-priority
performance.cache-refresh-timeout 1
performance.cache-samba-metadata false
performance.cache-size 128MB
performance.cache-size 32MB
performance.cache-swift-metadata true
performance.client-io-threads off
performance.enable-least-priority on
performance.flush-behind on
performance.force-readdirp true
performance.high-prio-threads 16
performance.io-cache on
performance.io-thread-count 16
performance.lazy-open yes
performance.least-prio-threads 1
performance.low-prio-threads 16
performance.md-cache-timeout 1
performance.nfs.flush-behind on
performance.nfs.io-cache off
performance.nfs.io-threads off
performance.nfs.quick-read off
performance.nfs.read-ahead off
performance.nfs.stat-prefetch off
performance.nfs.strict-o-direct off
performance.nfs.strict-write-ordering off
performance.nfs.write-behind on
performance.nfs.write-behind-window-size1MB
performance.normal-prio-threads 16
performance.open-behind on
performance.parallel-readdir off
performance.quick-read on
performance.rda-cache-limit 10MB
performance.rda-high-wmark 128KB
performance.rda-low-wmark 4096
performance.rda-request-size 131072
performance.read-after-open no
performance.read-ahead on
performance.read-ahead-page-count 4
performance.readdir-ahead on
performance.resync-failed-syncs-after-fsyncoff
performance.stat-prefetch on
performance.strict-o-direct off
performance.strict-write-ordering off
performance.write-behind on
performance.write-behind-window-size 1MB
Still, even when adding up all caches and values, I'm still only at 2.5gb memory per node I can account for.
Restarting the daemons does not reduce the memory usage and I did not find any further information on how to reduce the memory. Having 750mb or memory per volume simply seems excessive and would lead to serious problems very soon.
Any hints?