Ubuntu 18.04 Glusterfs-7.0
I have created a volume for my file share and started it:
sudo gluster volume create NAME replica 3 transport tcp host0:/path0 host1:/path1 host2:/path2
sudo gluster volume start NAME
Then, added fstab record to my clients:
host0:NAME /home/mountpoint glusterfs defaults,_netdev 0 0
And mounted it on my clients:
sudo mount /home/mountpoint
Then, randomly, after 1-7 days, it disconnects my clients (could disconnect 2 of 3) mostly at nighttime, but sometimes happens at day. If i cd into that directory, it says:
Transport endpoint is not connected
For mount to come back online, i have to do:
sudo umount /home/mountpoint && sudo mount /home/mountpoint
Most of the time, it works. But sometimes it fails with no specific reason in logfile but says "bricks are offline". Glusterd is running on all 3 servers and did not crash:
[2019-12-14 03:49:54.210690] W [socket.c:774:__socket_rwv] 0-launcher-client-2: readv on <IP>:<PORT> failed (No data available)
[2019-12-14 03:49:54.210718] I [MSGID: 114018] [client.c:2347:client_rpc_notify] 0-launcher-client-2: disconnected from launcher-client-2. Client process will keep trying to connect to glusterd until brick's port is available
[2019-12-14 03:49:54.210735] W [MSGID: 108001] [afr-common.c:5653:afr_notify] 0-launcher-replicate-0: Client-quorum is not met
[2019-12-14 03:49:57.271596] E [MSGID: 114058] [client-handshake.c:1456:client_query_portmap_cbk] 0-launcher-client-2: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.
[2019-12-14 03:50:23.647924] W [fuse-bridge.c:1276:fuse_attr_cbk] 0-glusterfs-fuse: 1649274: STAT() /www => -1 (Transport endpoint is not connected)
[2019-12-14 03:50:23.648092] W [fuse-bridge.c:1276:fuse_attr_cbk] 0-glusterfs-fuse: 1649275: STAT() /www => -1 (Transport endpoint is not connected)
[2019-12-14 03:50:46.192371] W [fuse-bridge.c:1276:fuse_attr_cbk] 0-glusterfs-fuse: 1649321: STAT() /www => -1 (Transport endpoint is not connected)
[2019-12-14 03:50:46.192445] W [fuse-bridge.c:1276:fuse_attr_cbk] 0-glusterfs-fuse: 1649322: STAT() /www => -1 (Transport endpoint is not connected)
[2019-12-14 03:50:46.626681] W [fuse-bridge.c:1276:fuse_attr_cbk] 0-glusterfs-fuse: 1649323: STAT() /www => -1 (Transport endpoint is not connected)
[2019-12-14 03:50:46.626769] W [fuse-bridge.c:1276:fuse_attr_cbk] 0-glusterfs-fuse: 1649324: STAT() /www => -1 (Transport endpoint is not connected)
[2019-12-14 03:50:48.254712] W [fuse-bridge.c:1276:fuse_attr_cbk] 0-glusterfs-fuse: 1649328: STAT() /www => -1 (Transport endpoint is not connected)
[2019-12-14 03:50:48.254862] W [fuse-bridge.c:1276:fuse_attr_cbk] 0-glusterfs-fuse: 1649329: STAT() /www => -1 (Transport endpoint is not connected)
[2019-12-14 03:51:02.002344] W [fuse-bridge.c:1276:fuse_attr_cbk] 0-glusterfs-fuse: 1649357: STAT() /www => -1 (Transport endpoint is not connected)
[2019-12-14 03:51:02.002426] W [fuse-bridge.c:1276:fuse_attr_cbk] 0-glusterfs-fuse: 1649358: STAT() /www => -1 (Transport endpoint is not connected)
[2019-12-14 03:51:02.478503] W [fuse-bridge.c:1276:fuse_attr_cbk] 0-glusterfs-fuse: 1649362: STAT() /www => -1 (Transport endpoint is not connected)
[2019-12-14 03:51:02.478566] W [fuse-bridge.c:1276:fuse_attr_cbk] 0-glusterfs-fuse: 1649363: STAT() /www => -1 (Transport endpoint is not connected)
[2019-12-14 03:51:02.870624] W [fuse-bridge.c:1276:fuse_attr_cbk] 0-glusterfs-fuse: 1649364: STAT() /www => -1 (Transport endpoint is not connected)
[2019-12-14 03:51:02.870713] W [fuse-bridge.c:1276:fuse_attr_cbk] 0-glusterfs-fuse: 1649365: STAT() /www => -1 (Transport endpoint is not connected)
[2019-12-14 03:51:13.450634] W [fuse-bridge.c:2837:fuse_readv_cbk] 0-glusterfs-fuse: 1649389: READ => -1 gfid=270fafc1-615a-4686-a0f8-50e17965ba10 fd=0x7f64c002c468 (Transport endpoint is not connected)
[2019-12-14 03:51:13.450702] W [fuse-bridge.c:2837:fuse_readv_cbk] 0-glusterfs-fuse: 1649390: READ => -1 gfid=270fafc1-615a-4686-a0f8-50e17965ba10 fd=0x7f64c002c468 (Transport endpoint is not connected)
[2019-12-14 03:51:13.450717] W [fuse-bridge.c:2837:fuse_readv_cbk] 0-glusterfs-fuse: 1649391: READ => -1 gfid=270fafc1-615a-4686-a0f8-50e17965ba10 fd=0x7f64c002c468 (Transport endpoint is not connected)
[2019-12-14 03:51:13.450807] W [fuse-bridge.c:2837:fuse_readv_cbk] 0-glusterfs-fuse: 1649392: READ => -1 gfid=270fafc1-615a-4686-a0f8-50e17965ba10 fd=0x7f64c002c468 (Transport endpoint is not connected)
[2019-12-14 03:51:13.450906] W [fuse-bridge.c:2837:fuse_readv_cbk] 0-glusterfs-fuse: 1649393: READ => -1 gfid=270fafc1-615a-4686-a0f8-50e17965ba10 fd=0x7f64c002c468 (Transport endpoint is not connected)
And i have to restart the volume itself on the server:
sudo gluster volume stop NAME && sudo gluster volume start NAME
Now this is not the first pool of servers to have such issue. I used to have the same problem on another cluster of servers. Could not solve it so had to move away from gluster.
From what i can say: - Servers did not lose connection at the time of glusterfs disconnect - Servers do not have HDD issues - Servers do not run any super-intensive applications on glusterfs, mostly folder share for nginx.
How can i solve this? Thanks.