4

Whilst trying to diagnose a different issue with my cluster I tried isolating my environments to force elections events. When starting nodes in isolation though my app failed to start with this exception:

Caused by: java.util.concurrent.TimeoutException: null
    at org.neo4j.cluster.statemachine.StateMachineProxyFactory$ResponseFuture.get(StateMachineProxyFactory.java:300) ~[neo4j-cluster-2.0.1.jar:2.0.1]
    at org.neo4j.cluster.client.ClusterJoin.joinByConfig(ClusterJoin.java:158) ~[neo4j-cluster-2.0.1.jar:2.0.1]
    at org.neo4j.cluster.client.ClusterJoin.start(ClusterJoin.java:91) ~[neo4j-cluster-2.0.1.jar:2.0.1]
    at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:503) ~[neo4j-kernel-2.0.1.jar:2.0.1]
    ... 59 common frames omitted

My configuration is set with a 60 second join timeout (ha.cluster_join_timeout) and such that the individual nodes can initialize the cluster (ha.allow_init_cluster).

Looking at a truncated chunk of code from the ClusterJoin class I believe that after some negative cases the code will either loop attempting again to connect, or that the current node will create a new cluster.

private void joinByConfig() throws TimeoutException
{
    while( true )
        {
            if (config.getClusterJoinTimeout() > 0)
            {
                try
                {
                    console.log( "Joined cluster:" + clusterConfig.get(config.getClusterJoinTimeout(), TimeUnit.MILLISECONDS ));
                    return;
                }
                catch ( InterruptedException e )
                {
                    console.log( "Could not join cluster, interrupted. Retrying..." );
                }
                catch ( ExecutionException e )
                {
                    logger.debug( "Could not join cluster " + this.config.getClusterName() );
                    if ( e.getCause() instanceof IllegalStateException )
                    {
                        throw ((IllegalStateException) e.getCause());
                    }

                    if ( config.isAllowedToCreateCluster() )
                    {
                        // Failed to join cluster, create new one
                        console.log( "Could not join cluster of " + hosts.toString() );
                        console.log( format( "Creating new cluster with name [%s]...", config.getClusterName() ) );
                        cluster.create( config.getClusterName() );
                        break;
                    }

                    console.log( "Could not join cluster, timed out. Retrying..." );
                }
            }

However a TimeoutException is not one of these cases and in fact the joinByConfig method also throws the TimeoutException. The StateMachineProxyFactory$ResponseFuture class (which implements Future) throws a TimooutException when time has been waited and no State Machine message has been received.

public synchronized Object get( long timeout, TimeUnit unit )
            throws InterruptedException, ExecutionException, TimeoutException
    {
        if ( response != null )
        {
            getResult();
        }

        this.wait( unit.toMillis( timeout ) );

        if ( response == null )
        {
            throw new TimeoutException();
        }
        return getResult();
    }

Should it be the case that when joining a cluster has timed out, and if configured to intialise a cluster that the TimoutException should not be propagated and that a new cluster should be initialised? If that is not the case, do clustered servers always have to be started up in unison?

JohnMark13
  • 3,709
  • 1
  • 15
  • 26

0 Answers0