I am running a 2-node-HA failover ActiveMQ Artemis 2.11.0 cluster using replication in our test environment. Both nodes (and all connecting clients (application servers)) use static configuration of nodes, i.e. no discovery is going on.
In one of my first tests, I set the application servers under medium load and wanted to see what would happen in a JMS broker failover scenario. First I stopped and, after a couple of minutes, restarted the slave node. Everything worked as expected.
Secondly I stopped the master node. Soon after, the slave announced to be live and the application servers failed over to the slave - on high level, i.e. the actual clients of the application server, this operation was transparent.
Then we restarted the master node, it synchronized with the slave, took over the responsibilities - and then the slave died.
The broker of each node is basically configured according to the official JMS Replicated Failback Static Example.
An excerpt of the logs of the starting master node:
2020-01-21 14:05:13,329 INFO [org.apache.activemq.artemis.integration.bootstrap] (main) AMQ101000: Starting ActiveMQ Artemis Server
2020-01-21 14:05:13,353 INFO [org.apache.activemq.artemis.core.server] (main) AMQ221000: live Message Broker is starting with configuration Broker Configuration (clustered=true,journalDirectory=data/journal,bindingsDirectory=data/bindings,largeMessagesDirectory=data/large-messages,pagingDirectory=data/paging)
2020-01-21 14:05:13,648 INFO [org.apache.activemq.artemis.core.server] (AMQ229000: Activation for server ActiveMQServerImpl::serverUUID=null) AMQ222162: Moving data directory /var/lib/artemis-master/data/bindings to /var/lib/artemis-master/data/bindings/oldreplica.1
...
2020-01-21 14:05:13,747 INFO [org.apache.activemq.artemis.core.server] (AMQ229000: Activation for server ActiveMQServerImpl::serverUUID=null) AMQ221043: Protocol module found: [artemis-server]. Adding protocol support for: CORE
...
2020-01-21 14:05:13,810 INFO [org.apache.activemq.artemis.core.server] (AMQ229000: Activation for server ActiveMQServerImpl::serverUUID=null) AMQ221109: Apache ActiveMQ Artemis Backup Server version 2.11.0 [null] started, waiting live to fail before it gets active
2020-01-21 14:05:13,877 INFO [org.apache.activemq.hawtio.branding.PluginContextListener] (main) Initialized activemq-branding plugin
2020-01-21 14:05:13,928 INFO [org.apache.activemq.hawtio.plugin.PluginContextListener] (main) Initialized artemis-plugin plugin
2020-01-21 14:05:14,577 INFO [org.apache.activemq.artemis] (main) AMQ241001: HTTP Server started at http://172.16.2.33:8161
2020-01-21 14:05:14,577 INFO [org.apache.activemq.artemis] (main) AMQ241002: Artemis Jolokia REST API available at http://172.16.2.33:8161/console/jolokia
2020-01-21 14:05:14,577 INFO [org.apache.activemq.artemis] (main) AMQ241004: Artemis Console available at http://172.16.2.33:8161/console
2020-01-21 14:05:21,108 INFO [org.apache.activemq.artemis.core.server] (Thread-4 (ActiveMQ-client-netty-threads)) AMQ221024: Backup server ActiveMQServerImpl::serverUUID=139d6d77-3c3c-11ea-b519-00505686d7ad is synchronized with live-server.
2020-01-21 14:05:21,115 INFO [org.apache.activemq.artemis.core.server] (Thread-0 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@37858383)) AMQ221031: backup announced
2020-01-21 14:06:00,209 INFO [org.apache.activemq.artemis.core.server] (Thread-0 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@37858383)) AMQ221066: Initiating quorum vote: LiveFailoverQuorumVote
2020-01-21 14:06:00,209 INFO [org.apache.activemq.artemis.core.server] (Thread-0 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@37858383)) AMQ221067: Waiting 30 seconds for quorum vote results.
2020-01-21 14:06:00,209 INFO [org.apache.activemq.artemis.core.server] (Thread-0 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@37858383)) AMQ221068: Received all quorum votes.
2020-01-21 14:06:00,210 INFO [org.apache.activemq.artemis.core.server] (Thread-0 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@37858383)) AMQ221071: Failing over based on quorum vote results.
2020-01-21 14:06:00,229 WARN [org.apache.activemq.artemis.core.client] (Thread-0 (ActiveMQ-client-global-threads)) AMQ212037: Connection failure to artemis-slave.local/172.16.2.34:61616 has been detected: AMQ219015: The connection was disconnected because of server shutdown [code=DISCONNECTED]
2020-01-21 14:06:00,229 WARN [org.apache.activemq.artemis.core.client] (Thread-1 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@37858383)) AMQ212037: Connection failure to artemis-slave.local/172.16.2.34:61616 has been detected: AMQ219015: The connection was disconnected because of server shutdown [code=DISCONNECTED]
2020-01-21 14:06:00,249 INFO [org.apache.activemq.artemis.core.server] (AMQ229000: Activation for server ActiveMQServerImpl::serverUUID=null) AMQ221037: ActiveMQServerImpl::serverUUID=139d6d77-3c3c-11ea-b519-00505686d7ad to become 'live'
2020-01-21 14:06:00,271 WARN [org.apache.activemq.artemis.core.client] (Thread-0 (ActiveMQ-client-global-threads)) AMQ212004: Failed to connect to server.
2020-01-21 14:06:00,822 WARN [org.apache.activemq.artemis.journal] (AMQ229000: Activation for server ActiveMQServerImpl::serverUUID=null) AMQ142015: Uncommitted transaction with id 2,147,639,299 found and discarded
...
2020-01-21 14:06:01,524 INFO [org.apache.activemq.artemis.core.server] (AMQ229000: Activation for server ActiveMQServerImpl::serverUUID=null) AMQ221007: Server is now live
2020-01-21 14:06:01,532 INFO [org.apache.activemq.artemis.core.server] (AMQ229000: Activation for server ActiveMQServerImpl::serverUUID=null) AMQ221020: Started EPOLL Acceptor at artemis-master.local:61616 for protocols [CORE,MQTT,AMQP,HORNETQ,STOMP,OPENWIRE]
And the same timeframe of the slave node:
2020-01-21 14:05:18,894 INFO [org.apache.activemq.artemis.core.server] (Thread-14) AMQ221025: Replication: sending AIOSequentialFile:/var/lib/artemis-slave/data/journal/activemq-data-367.amq (size=10,485,760) to replica.
...
2020-01-21 14:05:19,978 INFO [org.apache.activemq.artemis.core.server] (Thread-14) AMQ221025: Replication: sending AIOSequentialFile:/var/lib/artemis-slave/data/journal/activemq-data-374.amq (size=10,485,760) to replica.
2020-01-21 14:05:20,037 INFO [org.apache.activemq.artemis.core.server] (Thread-14) AMQ221025: Replication: sending AIOSequentialFile:/var/lib/artemis-slave/data/journal/activemq-data-375.amq (size=10,485,760) to replica.
2020-01-21 14:05:20,154 INFO [org.apache.activemq.artemis.core.server] (Thread-14) AMQ221025: Replication: sending NIOSequentialFile /var/lib/artemis-slave/data/bindings/activemq-bindings-5.bindings (size=1,048,576) to replica.
...
2020-01-21 14:05:20,163 INFO [org.apache.activemq.artemis.core.server] (Thread-14) AMQ221025: Replication: sending NIOSequentialFile /var/lib/artemis-slave/data/bindings/activemq-bindings-7.bindings (size=1,048,576) to replica.
2020-01-21 14:05:38,157 WARN [org.apache.activemq.artemis.core.client] (Thread-7 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@37858383)) AMQ212037: Connection failure to /172.16.2.112:51866 has been detected: AMQ229014: Did not receive data from /172.16.2.112:51866 within the 30,000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT]
2020-01-21 14:05:50,168 WARN [org.apache.activemq.artemis.core.server] (Thread-14) AMQ222010: Critical IO Error, shutting down the server. file=NULL, message=AMQ229114: Replication synchronization process timed out after waiting 30,000 milliseconds: ActiveMQReplicationTimeooutException[errorType=REPLICATION_TIMEOUT_ERROR message=AMQ229114: Replication synchronization process timed out after waiting 30,000 milliseconds]
at org.apache.activemq.artemis.core.replication.ReplicationManager.sendSynchronizationDone(ReplicationManager.java:654) [artemis-server-2.11.0.jar:2.11.0]
at org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager.startReplication(JournalStorageManager.java:699) [artemis-server-2.11.0.jar:2.11.0]
at org.apache.activemq.artemis.core.server.impl.SharedNothingLiveActivation$2.run(SharedNothingLiveActivation.java:178) [artemis-server-2.11.0.jar:2.11.0]
at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_222]
2020-01-21 14:05:50,171 WARN [org.apache.activemq.artemis.core.server] (Thread-14) AMQ222251: Unable to start replication: ActiveMQReplicationTimeooutException[errorType=REPLICATION_TIMEOUT_ERROR message=AMQ229114: Replication synchronization process timed out after waiting 30,000 milliseconds]
at org.apache.activemq.artemis.core.replication.ReplicationManager.sendSynchronizationDone(ReplicationManager.java:654) [artemis-server-2.11.0.jar:2.11.0]
at org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager.startReplication(JournalStorageManager.java:699) [artemis-server-2.11.0.jar:2.11.0]
at org.apache.activemq.artemis.core.server.impl.SharedNothingLiveActivation$2.run(SharedNothingLiveActivation.java:178) [artemis-server-2.11.0.jar:2.11.0]
at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_222]
2020-01-21 14:06:00,175 WARN [org.apache.activemq.artemis.core.client] (Thread-14) AMQ212041: Timed out waiting for netty channel to close
2020-01-21 14:06:00,226 ERROR [org.apache.activemq.artemis.core.journal.impl.JournalImpl] (Thread-35 (ActiveMQ-IO-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$7@6a6afff2)) appendRollbackRecord:java.lang.IllegalStateException: Cannot find tx with id 2147634792: java.lang.IllegalStateException: Cannot find tx with id 2147634792
at org.apache.activemq.artemis.core.journal.impl.JournalImpl$10.run(JournalImpl.java:1372) [artemis-journal-2.11.0.jar:2.11.0]
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42) [artemis-commons-2.11.0.jar:2.11.0]
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31) [artemis-commons-2.11.0.jar:2.11.0]
at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66) [artemis-commons-2.11.0.jar:2.11.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_222]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_222]
at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118) [artemis-commons-2.11.0.jar:2.11.0]
My interpretation is:
- Slave is currently live
- Master starts and resyncs with slave. The network replication takes about 1.2 seconds to transfer ~100MB of data.
- (Message
AMQ212037: Connection failure to /172.16.2.112:51866 has been detected
is a connection from the application servers that timed out due to inactivity I guess. This happens quite often and I think is caused by dormant pooled connections of the application servers) - At
14:05:21,108
the master announces "Backup server...is synchronized with live-server", which I interpret as "Replication successful, we can switch back". This is immediately followed byAMQ221031: backup announced
which sounds great. - At
14:05:50,168
, exactly 30 seconds after sending the last file successfully, the slave announces a "Critical IO Error"AMQ229114: Replication synchronization process timed out after waiting 30,000 milliseconds
and shuts down immediately. (Side note: logging this onWARN
level is an understatement in my opinion...) - The master does not report any error on his side at this time, rather a couple seconds afterwards it seems to notice the slave being down and starts a master election that is finished more or less immediately making master the new live server. To me this also seems to be correct behavior.
- (I don't really understand the consequences of the "transaction" part in both logs.)
How can I diagnose (and - actually even better - mitigate) the problem? Is there crucial information in the logs missing?
Thanks
Update 2020-01-27
Our issue looks an awful lot like the one described here: https://issues.apache.org/jira/browse/ARTEMIS-1170
That aside from application servers' side we use JTA-enabled Artemis clients. Could this be a source of problems?