We are currently running a three node cluster, on Gluster 3.6.4.
On one of our nodes we noticed that the glusterd daemon is dead.
But the glusterfsd daemons are still running, and we believe clients are connecting and retrieving data
We noticed that the daemon has been dead for a week, and we didn't see it. The NFS distributed mounts continued to work normally
We would like to know are we safe to just go ahead and start the glusterd service again?
If so would this trigger a self-heal on all volumes? As this would cause a performance issue.
The logs for this node is as follows::
[2016-08-19 18:01:52.804453] E [rpc-clnt.c:362:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f4f3ffca550] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e7)[0x7f4f3fd9f787] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f4f3fd9f89e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91)[0x7f4f3fd9f951] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15f)[0x7f4f3fd9ff1f] ))))) 0-DAOS-client-4: forced unwinding frame type(GF-DUMP) op(DUMP(1)) called at 2016-08-19 18:01:51.886737 (xid=0x144a1d)
[2016-08-19 18:01:52.804480] W [client-handshake.c:1588:client_dump_version_cbk] 0-DAOS-client-4: received RPC status error
[2016-08-19 18:01:52.804504] W [socket.c:620:__socket_rwv] 0-glusterfs: readv on 127.0.0.1:24007 failed (No data available)
[2016-08-19 18:02:02.900863] E [socket.c:2276:socket_connect_finish] 0-glusterfs: connection to 127.0.0.1:24007 failed (Connection refused)
If we aren't safe to do so, what else should we do to resolve this
( useful information: this blog entry discusses the difference between glusterfsd and glusterd http://blog.nixpanic.net/2013/12/gluster-and-not-restarting-brick.html )