We have 2 servers: server1 and server2. Both servers are running in domain mode configuration of Wildfly 11. Here's the documentation for how we have configured both servers.
Let’s consider server1 domain node in our case.
Issue:
If 2 messages with same group ID arrives simultaneously on server1 and server2 they won't know to which consumer the message should be sent to. Therefore, the message ends up being processed by different consumers and sometimes the message which arrived first gets processed later which is not desirable. We would like to configure the system so that both nodes know each other that the message should be processed by which consumer.
Solution we tried:
Configured the server1 with group handler LOCAL and server2 with REMOTE. Now whenever the message arrives LOCAL group handler identifies that the consumer of that particular group ID is on which node and the message is picked accordingly.
This solution is valid until the server1 is running fine. However, if the server1 goes down messages are not processed. To fix this we added backup for messaging subsystem active-mq of server1 to server2 and similarly did the same for server2.
/profile=abc/subsystem=messaging-activemq/server=backup:add
(backup server would be added to both nodes as server1 is the domain node)
Also we added the same discovery-group, http-connector, broadcast-group to this backup server. We established cluster-connection for backup and live servers to be in the same group for server1 and server2.
/profile=abc/subsystem=messaging-activemq/server=default/ha-policy=replication-master:add(cluster-name=my-cluster,group-name=${livegroup},check-for-live-server=true)
/profile=abc/subsystem=messaging-activemq/server=backup/ha-policy=replication-slave:add(cluster-name=my-cluster,group-name=${backupgroup})
server1 is configured to read the below properties:
livegroup=group1
backupgroup=group2
server2 is configured to read the below properties:
livegroup=group2
backupgroup=group1
However, this solution does not seem to fix the failover condition and messages were not processed on other node when the live node with local group handler node was down. We get the below error on server2 when server1 shuts down:
[org.apache.activemq.artemis.core.server] (default I/O-3) AMQ222092: Connection to the backup node failed, removing replication now: ActiveMQRemoteDisconnectException[errorType=REMOTE_DISCONNECT message=null]
at org.apache.activemq.artemis.core.remoting.server.impl.RemotingServiceImpl.connectionDestroyed(RemotingServiceImpl.java:533)
at org.apache.activemq.artemis.core.remoting.impl.netty.NettyAcceptor$Listener.connectionDestroyed(NettyAcceptor.java:682)
at org.apache.activemq.artemis.core.remoting.impl.netty.ActiveMQChannelHandler.channelInactive(ActiveMQChannelHandler.java:79)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:360)
at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:325)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1329)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:908)
at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:744)
at org.xnio.nio.WorkerThread.safeRun(WorkerThread.java:612)
at org.xnio.nio.WorkerThread.run(WorkerThread.java:479)
Please suggest any other approach altogether to handle the issue or how can we configure the scenario where the server with LOCAL group handler shuts down.