Apache NMS throws an established connection was aborted by the software in your host machine under heavy use

Question

Background:
C# WPF application talking to JAVA server running on linux via ActiveMQ/JSON
Total 5 instances of connection:
Queues: 2
Topics: 3 (1 producer, 2 consumers)

Problem:
Under heavy use (throughput rate of sending/receiving around 200 messages in less than 500ms and memory working set around 1-1.2 GB), throws ‘An established connection was aborted by the software in your host machine’.

Sample stack:

Apache.NMS.NMSException: Unable to read data from the transport connection: An established connection was aborted by the software in your host machine. ---> System.IO.IOException: Unable to read data from the transport connection: An established connection was aborted by the software in your host machine. ---> System.Net.Sockets.SocketException: An established connection was aborted by the software in your host machine
   at System.Net.Sockets.Socket.Receive(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags)
   at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size)
   --- End of inner exception stack trace ---
   at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size)
   at System.IO.BufferedStream.Read(Byte[] array, Int32 offset, Int32 count)
   at System.IO.BinaryReader.FillBuffer(Int32 numBytes)
   at System.IO.BinaryReader.ReadInt32()
   at Apache.NMS.Util.EndianBinaryReader.ReadInt32()
   at Apache.NMS.ActiveMQ.OpenWire.OpenWireFormat.Unmarshal(BinaryReader dis)
   at Apache.NMS.ActiveMQ.Transport.Tcp.TcpTransport.ReadLoop()

Tried so far:

Switched off Inactivity Monitoring to reduce traffic across 5 connections. Mostly because application has got its own heartbeat implementation.
Set ConnectionFactory.OptimizeAcknowledge to true to batch the acknowledgement

It would seem to me that this is something that is to be expected in any networked scenario. There, for a variety of reasons (e.g. the server being too busy to respond in time, client buffer too full to acknowledge in time, ...), connections get dropped. Generally, if the disconnected endpoint is not offline, a retry, sometimes using an exponential backoff delay to provide breathing room, will resolve this. I would assume this is something that ActiveMQ does/can do. — Alex, Apr 27 '15 at 17:12
Thanks for that Alex. Have already addressed the server too busy scenario by increasing the timeout and client buffer full scenario by batching the acknowledgements. When this particular error is thrown, the endpoint is always online and do have a reconnect logic in place that works quite well (reconnects in less than few secs). The annoying thing is it impacts any user based interaction at that point in time which can only be resumed again manually! — Rookie, Apr 28 '15 at 09:22
While playing with the Apache.NMS.ActiveMQ source code, noticed that KeepAliveInfo messages keep flowing (server -->client -->server) even after switching off Inactivity Monitoring as it is used at the TCP level. Will try setting the keepAlive=false on the server broker and see if that reduces any load (extra 10 messages every 20 sec) — Rookie, Apr 28 '15 at 09:31
increasing buffers / reducing traffic is likely only going to delay **when** this problem occurs as the load grows. If an essentially recoverable temporary connection drop, causes user interaction to have to be resumed by doing some form of manual intervention (I am assuming by moving a message from the dead letter queue back into the processing queue), that is a design problem that you will have to solve. — Alex, Apr 28 '15 at 16:22
See your point about redesign. An interruption doesn't actually entail an admin sort of intervention to resume an ongoing workflow as it is resumed automatically from where it was last left at but it does stop the user from doing any new requests that can't be exactly automated. About the disconnection, wondering if it can be anticipated somehow in which case the flow rate can be dynamically throttled. Understand could have some sort of logic to track throughput and when it is close to a certain predefined threshold (determined via capacity testing) can throttle the flow. — Rookie, Apr 29 '15 at 11:37
Yes, you can (should) monitor the loads (bandwidth, CPU, Memory, GC, latencies, ...), so that you know what you can handle without problems and you could throttle or try to scale up dynamically. That might reduce the frequency of dropped connections due to load problems. However that still does not remove the design problem: i.e. correct behavior when something fails. In a networked scenario you should expect failure, and be resilient against it. — Alex, Apr 29 '15 at 15:05
A strategy used at Netflix to ensure resilience of the design against failure, is to fail often and **on purpose**. Have a look at [**Chaos Monkey**](http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html) which they use to purposefully kill services "randomly" **in production**. Also, don't take my remark regarding "correct behavior" in the previous comment too literally. I was not implying that current behavior in your system is not correct. If it is simply user inconvenience as you pointed out, this may be perfectly acceptable as long as it does not occur frequently. — Alex, Apr 29 '15 at 15:06
It happens around 2-3 times a day and each time it comes back in less than 30 secs, so that way it is kind of acceptable so far at least! Am trying to gather as much evidence as possible in terms of memory/cpu usage when it occurs before implementing a dynamic throttle based solution. Thanks for all your inputs Alex. Much appreciated! — Rookie, Apr 30 '15 at 10:14

Apache NMS throws an established connection was aborted by the software in your host machine under heavy use

0 Answers0