We've been using Hazelcast for a number of years but I'm new to the group. We have a cluster formed by a dedicated Java application (it's sole purpose is to provide the cluster). It's using the 3.8.2 jars and running JDK 1.8.0_192 on Linux (Centos 7).
The cluster manages relatively static data (ie. a few updates a day/week). Although an update may involve changing a 2MB chunk of data. We're using the default sharding config with 271 shards across 6 cluster members. There are between 40 and 80 clients. Each client connection should be long-lived and stable.
"Occasionally" we get into a situation where the Java app that's providing the cluster repeatedly restarts and any client that attempts to write to the cluster is unable to do so. We've had issues in the past where the cluster app runs out of memory due to limits on the JVM command line. We've previously increased these and (to the best of my knowledge) the process restarts are no longer caused by OutOfMemory exceptions.
I'm aware we're running a very old version and many people will suggest simply updating. This is work we will carry out but we're attempting to diagnose the existing issue with the system we have in front of us.
What I'm looking for here is any suggestions regarding types of investigation to carry out, queries to run (either periodically when the system is healthy or during the time when it is in this failed state).
We use tools such as: netstat, tcpdump, wireshark and top regularly (I'm sure there are more) when diagnosing issues such as this but have been unable to establish a convincing root cause of this issue.
Any help greatly appreciated.
Thanks, Dave
As per the problem description. Our only way to resolve the issue is to bounce the cluster completely - ie. stop all the members and then restart the cluster. Ideally we'd have a system to remained stable and could recover from whatever "event" causes the issue we're seeing. This may involve config or code changes.