We have (won't be true for too long, if the powers that be have their way) a reasonably large cluster of about 600 nodes, all of them under the same "Group Name", while only a fraction of them (about a dozen) ever made it into the list of TCP/IP Interfaces defined in the hazelcast.xml
Here's our configuration
<hazelcast xsi:schemaLocation="http://www.hazelcast.com/schema/config hazelcast-config-3.1.xsd"
xmlns="http://www.hazelcast.com/schema/config"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<group>
<name>BlappityBlah</name>
<password>blahBlaha</password>
</group>
<management-center enabled="false"/>
<network>
<port auto-increment="true">6401</port>
<outbound-ports>
<!--
Allowed port range when connecting to other nodes.
0 or * means use system provided port.
-->
<ports>0</ports>
</outbound-ports>
<join>
<multicast enabled="false">
<multicast-group>224.2.2.3</multicast-group>
<multicast-port>54327</multicast-port>
</multicast>
<tcp-ip enabled="true">
<interface>10.50.3.101-102,10.50.3.104-105,10.50.3.108-112,10.60.2.20,10.60.3.103,10.60.4.106-107</inter
face>
</tcp-ip>
<aws enabled="false">
<access-key>my-access-key</access-key>
<secret-key>my-secret-key</secret-key>
<!--optional, default is us-east-1 -->
The rest are only bound by the "Group Name", which defines the cluster, per my understanding. We don't use multicast in our configuration. Primary application of our cluster is in distributed locking. What we are noticing of late is the arbitrary timeouts and dropping of connection between nodes, repeated "re-partitioning" and hanging locks. Everything freezes up after a while.. Earlier we ended up rebooting the nodes, now we use the Hazelcast TestApp console to clear up the map of locks. I can vouch on the fact that the code that locks and unlocks are reasonably water tight. My observation.. We didn't have these kind of issues earlier, until we updated Hazelcast to 3.1.5 AND scaled our nodes from 30 odd to now 500+, of which most nodes are JVMs, often up to a dozen on the same physical node. This didn't happen overnight, it was gradual.
a) Does the fact that most of our nodes don't figure in the hazelcast.xml impact their stability as members of the cluster?
b) Has anybody seen issues with scaling, is this a Hazelcast bug, or are we doing something terribly wrong while the rest of you are having a ball with Hazelcast?