Whilst trying to diagnose a different issue with my cluster I tried isolating my environments to force elections events. When starting nodes in isolation though my app failed to start with this exception:
Caused by: java.util.concurrent.TimeoutException: null
at org.neo4j.cluster.statemachine.StateMachineProxyFactory$ResponseFuture.get(StateMachineProxyFactory.java:300) ~[neo4j-cluster-2.0.1.jar:2.0.1]
at org.neo4j.cluster.client.ClusterJoin.joinByConfig(ClusterJoin.java:158) ~[neo4j-cluster-2.0.1.jar:2.0.1]
at org.neo4j.cluster.client.ClusterJoin.start(ClusterJoin.java:91) ~[neo4j-cluster-2.0.1.jar:2.0.1]
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:503) ~[neo4j-kernel-2.0.1.jar:2.0.1]
... 59 common frames omitted
My configuration is set with a 60 second join timeout (ha.cluster_join_timeout
) and such that the individual nodes can initialize the cluster (ha.allow_init_cluster
).
Looking at a truncated chunk of code from the ClusterJoin
class I believe that after some negative cases the code will either loop attempting again to connect, or that the current node will create a new cluster.
private void joinByConfig() throws TimeoutException
{
while( true )
{
if (config.getClusterJoinTimeout() > 0)
{
try
{
console.log( "Joined cluster:" + clusterConfig.get(config.getClusterJoinTimeout(), TimeUnit.MILLISECONDS ));
return;
}
catch ( InterruptedException e )
{
console.log( "Could not join cluster, interrupted. Retrying..." );
}
catch ( ExecutionException e )
{
logger.debug( "Could not join cluster " + this.config.getClusterName() );
if ( e.getCause() instanceof IllegalStateException )
{
throw ((IllegalStateException) e.getCause());
}
if ( config.isAllowedToCreateCluster() )
{
// Failed to join cluster, create new one
console.log( "Could not join cluster of " + hosts.toString() );
console.log( format( "Creating new cluster with name [%s]...", config.getClusterName() ) );
cluster.create( config.getClusterName() );
break;
}
console.log( "Could not join cluster, timed out. Retrying..." );
}
}
However a TimeoutException
is not one of these cases and in fact the joinByConfig method also throws the TimeoutException. The StateMachineProxyFactory$ResponseFuture
class (which implements Future) throws a TimooutException
when time has been waited and no State Machine message has been received.
public synchronized Object get( long timeout, TimeUnit unit )
throws InterruptedException, ExecutionException, TimeoutException
{
if ( response != null )
{
getResult();
}
this.wait( unit.toMillis( timeout ) );
if ( response == null )
{
throw new TimeoutException();
}
return getResult();
}
Should it be the case that when joining a cluster has timed out, and if configured to intialise a cluster that the TimoutException should not be propagated and that a new cluster should be initialised? If that is not the case, do clustered servers always have to be started up in unison?