6

We use STOMP broker relay(External Broker - ActiveMQ 5.13.2) in our Project see https://docs.spring.io/spring/docs/current/spring-framework-reference/web.html#websocket-stomp-handle-broker-relay

We use following stack:

org.springframework:spring-jms:jar:5.1.8.RELEASE
org.springframework:spring-messaging:jar:5.1.8.RELEASE
io.projectreactor:reactor-core:jar:3.2.8.RELEASE
io.projectreactor.netty:reactor-netty:jar:0.8.6.RELEASE
io.netty:netty-all:jar:4.1.34.Final

From time to time(lets say once a 2 weeks) we can observe in tomcat catalina.out logs error

2019-08-21 13:38:58,891 [tcp-client-scheduler-5] ERROR com.*.websocket.stomp.SimpMessagingSender  - BrokerAvailabilityEvent[available=false, StompBrokerRelay[ReactorNettyTcpClient[reactor.netty.tcp.TcpClientDoOn@219abb46]]]
2019-08-21 13:38:58,965 [tcp-client-scheduler-1] ERROR org.springframework.messaging.simp.stomp.StompBrokerRelayMessageHandler  - Transport failure: java.lang.IllegalStateException: No TcpConnection available

After that error STOMP communication is broken(system connection - single TCP connection is not available)

And it seems that everything started when we update stack from:

org.springframework:spring-jms:jar:5.0.8.RELEASE
org.springframework:spring-messaging:jar:5.0.8.RELEASE
io.projectreactor:reactor-core:jar:3.1.8.RELEASE
io.projectreactor.netty:reactor-netty:jar:0.7.8.RELEASE
io.netty:netty-all:jar:4.1.25.Final

ActiveMQ version not changed

There is a bug reported in spring that auto-reconnect failed when the system connection lost see: https://github.com/spring-projects/spring-framework/issues/22080

And now 3 questions:

  1. How to make this problem more reproducible?
  2. How to fix this reconnect behavior? :)
  3. How to prevent to lose this connection? :)

EDIT 23.09.2019

After error occurred TCP stack for port 61613(STOMP) is the following(Please note CLOSE_WAIT state):

netstat -an | grep 61613
tcp6       0      0 :::61613                :::*                    LISTEN
tcp6       2      0 127.0.0.1:49084         127.0.0.1:61613         CLOSE_WAIT
snieguu
  • 2,073
  • 2
  • 20
  • 39
  • While ActiveMQ is involved here it doesn't appear to be the source of the issue (especially given the fact that the version hasn't changed) so I'm removing the `activemq` tag. – Justin Bertram Sep 23 '19 at 21:21
  • Just an idea, but does this occur after the main router goes offline and comes back up? I have some software that seem to need to restart their services when the network goes down an comes back up. – Tschallacka Jan 24 '20 at 16:20

1 Answers1

0

I can't say that I have enough information to answer your question although I have some input that may help you find a way forward.

ActiveMQ is typically used in an environment that is hosted/distributed, so load and scaling should always be a consideration.

Most dbs/message queues/ect.. will need some sort of tuning for load - even on AWS (via requesting higher limits) even though most of that is taken care of by the hosting provider.

But I digress...

In this case it appears you're using the TCP transport for your queue:

https://activemq.apache.org/tcp-transport-reference

As you can see, all of these settings can be tuned and have default values.

So in the case of issues logged from the spring side connecting to AMQ, you'll want to narrow down the time of the error and then go look at your AMQ metrics and logs.

If you don't have monitoring for AMQ, I suggest:

  1. Add Monitoring - https://activemq.apache.org/how-can-i-monitor-activemq
  2. Add logging (or find out where the logs are). - Then enable detailed logging. (AMQ uses log4j, so just look at the log4j config file or add one.) Beyond this, consider sending the logs to a log aggregator. -- https://activemq.apache.org/how-can-i-enable-detailed-logging
  3. Look at your hosting provider's metrics & downtime. For instance, if using AWS, there are very detailed incident logs for network failures or momentary issues with VPC or cross-region tunneling, network traffic in/out ect..

Setting up the right tools for your distributed systems to enable your team to search/find errors/logs (and documenting how to do it) is extremely helpful. A step beyond this (for mature systems) is to add a layer on top of your monitoring so that your systems start telling you when there is a problem instead of the other way around (go looking for problems).

That may be a bit verbose - but that all leads up to me asking if you have logs / metrics for the AMQ system at the times of the failure. If you do, please post them!

I make these suggestions because:

  • There is no information provided on your load expectation, variability of load, or recognition that load is a consideration in a system (via troubleshooting steps).
  • Logs/errors provided are strictly from the client side.
  • The reproducibility of the error is infrequent and inconsistent - so it could be almost anything (memory leak, load issue, etc..) - so monitoring is necessary.

Also consider adding Spring Actuator for monitoring your message client on the spring side, as there are frequently limitations/settings for client connection pools & advanced settings too, especially if you scale up/down instance size, etc.. and your instance will be handling more/less load, your client libs may need some settings tuning.

https://www.baeldung.com/spring-boot-actuators

Exposing metrics about current Websocket connections with Spring

You can also catch the exception and tear down & re-create your connection/settings - although this wouldn't be the first thing I recommend without knowing more about the situations & stats at the time of the connection failure.

TheJeff
  • 3,665
  • 34
  • 52
  • Despite the fact that this answer does not resolve the problem, you earn points, in recognition of the attempt. – snieguu Jan 25 '20 at 19:20