Stability of large Hazelcast cluster

Question

We have (won't be true for too long, if the powers that be have their way) a reasonably large cluster of about 600 nodes, all of them under the same "Group Name", while only a fraction of them (about a dozen) ever made it into the list of TCP/IP Interfaces defined in the hazelcast.xml

Here's our configuration

<hazelcast xsi:schemaLocation="http://www.hazelcast.com/schema/config hazelcast-config-3.1.xsd"
           xmlns="http://www.hazelcast.com/schema/config"
           xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <group>
        <name>BlappityBlah</name>
        <password>blahBlaha</password>
    </group>
    <management-center enabled="false"/>
    <network>
        <port auto-increment="true">6401</port>
        <outbound-ports>
            <!--
            Allowed port range when connecting to other nodes.
            0 or * means use system provided port.
            -->
            <ports>0</ports>
        </outbound-ports>
        <join>
            <multicast enabled="false">
                <multicast-group>224.2.2.3</multicast-group>
                <multicast-port>54327</multicast-port>
            </multicast>
            <tcp-ip enabled="true">
                <interface>10.50.3.101-102,10.50.3.104-105,10.50.3.108-112,10.60.2.20,10.60.3.103,10.60.4.106-107</inter
face>
            </tcp-ip>
            <aws enabled="false">
                <access-key>my-access-key</access-key>
                <secret-key>my-secret-key</secret-key>
                <!--optional, default is us-east-1 -->

The rest are only bound by the "Group Name", which defines the cluster, per my understanding. We don't use multicast in our configuration. Primary application of our cluster is in distributed locking. What we are noticing of late is the arbitrary timeouts and dropping of connection between nodes, repeated "re-partitioning" and hanging locks. Everything freezes up after a while.. Earlier we ended up rebooting the nodes, now we use the Hazelcast TestApp console to clear up the map of locks. I can vouch on the fact that the code that locks and unlocks are reasonably water tight. My observation.. We didn't have these kind of issues earlier, until we updated Hazelcast to 3.1.5 AND scaled our nodes from 30 odd to now 500+, of which most nodes are JVMs, often up to a dozen on the same physical node. This didn't happen overnight, it was gradual.

a) Does the fact that most of our nodes don't figure in the hazelcast.xml impact their stability as members of the cluster?

b) Has anybody seen issues with scaling, is this a Hazelcast bug, or are we doing something terribly wrong while the rest of you are having a ball with Hazelcast?

What is the reason you need a 600 node cluster? A simple 10 node cluster with 15 gb heaps should be able to hold a very serious quantity of locks. — pveentjer, Jan 13 '16 at 06:22
And what is the reason you have up to a dozen HZ JVM's on a single physical node? — pveentjer, Jan 13 '16 at 06:50
I never got a chance to see this question but this is the nature of our cluster, there are intense math computations churning through each of these nodes, a good dozen of them colocated on the same machine, each of which may have upto 300GB RAM and are partitioned by a dozen to handle different aspects of our workflow. Could we do it with Hadoop now? May be but the dinosaurs continue to live — SriniMurthy, Sep 21 '16 at 21:30

score 2 · Answer 1 · answered Jan 13 '16 at 19:05

2

a) Does the fact that most of our nodes don't figure in the hazelcast.xml impact their stability as members of the cluster?

No.

b) Has anybody seen issues with scaling, is this a Hazelcast bug, or are we doing something terribly wrong while the rest of you are having a ball with Hazelcast?

The chance cluster repartitioning increases as you add nodes. I.e. if the chance of a single node failing is e.g. 0.01% per day, then with 600 nodes, your chance of seeing a daily node failure (= repartitioning) is almost 6%. With a chance of 0.001% failure per node per day, you'd still be around 1% chance cluster-wide.

In other words, you're cluster is probably larger than what's advisable, regardless of implementation.

answered Jan 13 '16 at 19:05

nilskp

3,097
1
30
34

3

One thing to also note: If you run such a big cluster you'll have to change the partition count which by default is 271 (property "hazelcast.partition.count"). Please see docs about how to do that: http://docs.hazelcast.org/docs/3.5/manual/html-single/#system-properties – noctarius Jan 14 '16 at 09:05
Thank you, noctarius and nilskp. I'll update my findings as and when these configuration changes make any impact on my cluster – SriniMurthy Jan 14 '16 at 18:06

Stability of large Hazelcast cluster

1 Answers1