3

I had a terrible night trying to figure out what is going on with RabbitMQ and SpringXD, unfortunately without a success.

The problem: SpringXD closes RabbitMQ connections repeatedly, or reports warnings related to the channel cache size.

Fragment from the SpringXD log (during stream initialization/autowiring):

 2016-05-03T07:42:43+0200 1.3.0.RELEASE WARN
 DeploymentsPathChildrenCache-0 listener.SimpleMessageListenerContainer
 - CachingConnectionFactory's channelCacheSize can not be less than the 
 number of concurrentConsumers so it was reset to match: 4

...

 2016-05-03T07:54:17+0200 1.3.0.RELEASE ERROR AMQP Connection
 192.168.120.125:5672 connection.CachingConnectionFactory - Channel shutdown: connection error

 2016-05-03T17:38:58+0200 1.3.0.RELEASE ERROR AMQP Connection
 192.168.120.125:5672 connection.CachingConnectionFactory - Channel shutdown: connection error; protocol method:
 method<connection.close>(reply-code=504, reply-text=CHANNEL_ERROR - 
 second 'channel.open' seen, class-id=20, method-id=10)

Fragment from the RabbitMQ log:

 =WARNING REPORT==== 3-May-2016::08:08:09 === closing AMQP connection <0.22276.61> (192.168.120.125:59350 -> 192.168.120.125:5672): client
 unexpectedly closed TCP connection

 =ERROR REPORT==== 3-May-2016::08:08:11 === closing AMQP connection 0.15409.61> (192.168.120.125:58527 -> 192.168.120.125:5672):
 {writer,send_failed,{error,closed}}

state blocked error is rare

 =ERROR REPORT==== 3-May-2016::17:38:58 === Error on AMQP connection <0.20542.25> (192.168.120.125:59421 -> 192.168.120.125:5672, vhost:
'/', user: 'xd', state: blocked), channel 7: operation channel.open
caused a connection exception channel_error: "second 'channel.open'
 seen"

My setup (6 nodes)

- springxd 1.3.0 distributed (zookeeper)  
- RabbitMQ 3.6.0, Erlang R16B03-1 cluster


    ackMode:                   AUTO ## or NONE
    autoBindDLQ:               false
    backOffInitialInterval:    1000
    backOffMaxInterval:        10000
    backOffMultiplier:         2.0
    batchBufferLimit:          10000
    batchingEnabled:           false
    batchSize:                 200
    batchTimeout:              5000
    compress:                  false
    concurrency:               4
    deliveryMode:              NON_PERSISTENT ## or PERSISTENT
    durableSubscription:       false
    maxAttempts:               10
    maxConcurrency:            10
    prefix:                    xdbus.
    prefetch:                  1000
    replyHeaderPatterns:       STANDARD_REPLY_HEADERS,*
    republishToDLQ:            false
    requestHeaderPatterns:     STANDARD_REQUEST_HEADERS,*
    requeue:                   true
    transacted:                false
    txSize:                    1000

spring: rabbitmq:

addresses:
priv1:5672,priv2:5672,priv3:5672,
priv4:5672,priv5:5672,priv6:5672

adminAddresses:  
http://priv1:15672, http://priv2:15672, http://priv3:15672, http://priv4:15672, http://priv5:15672,http://priv6:15672

nodes: 
rabbit@priv1,rabbit@priv2,rabbit@priv3,
rabbit@priv4,rabbit@priv5,rabbit@priv6

username: xd
password: xxxx
virtual_host: /
useSSL: false

ha-xdbus policy:

 - ^xdbus\. all  
 - ha-mode: exactly
 - ha-params:   2 
 - queue-master-locator:    min-masters

Rabbit conf

[
 {rabbit, 
[
     {tcp_listeners, [5672]},
     {queue_master_locator, "min-masters"}
]
}
].

When ackMode is NONE the following happens:

Eventually the number of consumers drop to zero and I have a zombie streams that don't recover from that state, which in turn causes unwanted queueing.

When ackMode is AUTO the following happens:

Some messages left un-acked forever.

SpringXD streams and durable queues

Rabbit module is being used as source or sink, no custom autowiring.

Typical stream definitions are as follows:

Ingestion:

event_generator | rabbit --mappedRequestHeaders=XDRoutingKey --routingKey='headers[''XDRoutingKey'']'

Processing/Sink:

rabbit --queues='xdbus.INQUEUE-A' | ENRICHMENT-PROCESSOR-A | elastic-sink
rabbit --queues='xdbus.INQUEUE-B' | ENRICHMENT-PROCESSOR-B | elastic-sink

xdbus.INQUEUE-xxx are manually created from the Rabbit admin GUI. (durable)

GLOBAL statistics (from the RabbitMQ Admin)

  • Connections: 190
  • Channels: 2263 (Channel cache problem perhaps ?)
  • Exchanges: 20
  • Queues: 120
  • Consumers : 1850

Finally:

I would appreciate if someone could answer what is wrong with the configuration (I am pretty sure the network is performing well, so there are no network problems and there is no problem related to max open files limitation).

Message rates vary from 2K/sec to max 30k/sec which is relative small load.

Thanks!

Ivan

miken32
  • 42,008
  • 16
  • 111
  • 154
Ivan Prostran
  • 53
  • 1
  • 5

1 Answers1

2

We have seen some similar instability when churning channels at a high rate.

The work-around was to increase the channel cache size to avoid the high rate of churning; it's not clear where the instability lies, but I don't believe it is in Spring AMQP.

One problem, however, is that XD doesn't expose channelCacheSize as a property.

The answer at the link above has a work-around to add the property by replacing the bus configuration XML. Increasing the cache size solved that user's problem.

We have an open JIRA issue to expose the property but it's not implemented yet.

I see you originally posted this as an 'answer' to that question.

Could someone be more specific and explain where exactly rabbit-bus.xml should be installed and why is this happening anyway.

As it says there, you need to put it under the xd config directory:

xd/config/META-INF/spring-xd/bus/rabbit-bus.xml.

EDIT

Technique using the bus extension mechanism instead...

$ cat xd/config/META-INF/spring-xd/bus/ext/cf.xml 

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:context="http://www.springframework.org/schema/context"
    xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd
        http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context.xsd">

    <bean id="rabbitConnectionFactory" class="org.springframework.amqp.rabbit.connection.CachingConnectionFactory">
        <constructor-arg ref="rabbitFactory" />
        <property name="addresses" value="${spring.rabbitmq.addresses}" />
        <property name="username" value="${spring.rabbitmq.username}" />
        <property name="password" value="${spring.rabbitmq.password}" />
        <property name="virtualHost" value="${spring.rabbitmq.virtual_host}" />
        <property name="channelCacheSize" value="${spring.rabbitmq.channelCacheSize:100}" />
    </bean>

</beans>

EDIT: TEST RESULTS

Prepopulated queue foo with 1 million messages.

    concurrency:               10
    prefetch:                  1000
    txSize:                    1000

.

xd:>stream create foo --definition "rin:rabbit --concurrency=10 --maxConcurrency=10 --prefetch=1000 --txSize=1000 | t1:transform | t2:transform | rout:rabbit --routingKey='''bar'''" --deploy
Created and deployed new stream 'foo'

So with this configuration, we end up with 40 consumers.

I never saw more than 29 publishing channels from the bus, there were 10 publishers for the sink.

1m messages were transferred from foo to bar in less than 5 minutes (via xdbus.foo.0, xdbus.foo.1 and xdbus.foo.2) - 4m messages published.

No errors - but my laptop needs to cool off :D

Community
  • 1
  • 1
Gary Russell
  • 166,535
  • 14
  • 146
  • 179
  • The question is where is the xd config directory. ($SPRING-XD-ROOT/xd/config ?) Instead I tried this but that didn't help.
    >if (this.getConnectionFactory() instanceof CachingConnectionFactory) { > CachingConnectionFactory cf = (CachingConnectionFactory) getConnectionFactory(); > if (cf.getCacheMode() == CacheMode.CHANNEL && >cf.getChannelCacheSize() < this.concurrentConsumers) { > cf.setChannelCacheSize(200); > /*cf.setChannelCacheSize(this.concurrentConsumers);*/
    – Ivan Prostran May 03 '16 at 22:44
  • No, you can't use logic like that; you need to put that exact file from the answer in the exact spot I describe above - `/xd/config/META-INF/spring-xd/bus/rabbit-bus.xml`. It **replaces** the existing file in the xd-dirt jar - that location (`/xd/config`) is earlier on the classpath so dirt will pick up the modified version of the XML configuration for the bus. – Gary Russell May 03 '16 at 22:45
  • Note that this will increase the cache size for the rabbit message bus; for the rabbit sink, you will need to add `channel-cache-size` to the `` in the sink's rabbit.xml. – Gary Russell May 04 '16 at 14:43
  • I wonder why the rabbit sink module needs the "channel-cache-size" as you stated earlier and the rabbit source does not. If I am not wrong both inject the rabbitConnectionFactory reference and that reference is the CachingConnectionFactory instance. Anyway, many thanks for your assistance. Ofc, I'll let you know when I'm done with the test. – Ivan Prostran May 04 '16 at 18:02
  • The source doesn't need it because the channel(s) in the source are long lived, and are never returned to the cache anyway. With the sink, which uses a different connection factory to the bus, channels are (very) short lived during the publish operation so, in a high-volume multi-threaded environment you can churn channels in the sink too (in the same way they are churned in the bus for inter-module communication). – Gary Russell May 04 '16 at 18:09
  • But what is actually going on with the sink and short lived channels. SpringXD is heavy multi threaded environment and channels should not be shared between threads (fine). So channel cache is a place for a channel sharing etc.. If cache size is 1, one thread produces at a time through that particular cached channel. Despite the fact that such setup performs bad and it is not an optimal solution (multi cores) because producer threads are serialized what could possibly go wrong with that underlying connection and channel? – Ivan Prostran May 04 '16 at 19:06
  • I'm sorry, I don't have an answer for why churning channels at a high rate causes problems - just that it does and caching the channels for reuse avoids it. I have written a test case but haven't been able to reproduce it and when I crank it up to a higher rate, my RabbitMQ server goes into a tail spin (with a lot of swapping). So, I think the underlying root cause is going to be hard to figure out but, clearly, creating and destroying channels at that rate is not good; hence the need to cache and reuse the channels when publishing in the sink (as well as publishing to the bus). – Gary Russell May 04 '16 at 19:13
  • Just to close this out - the same test (1000 threads publishing 1000 messages each) works just fine when the channels are cached. Without caching, the broker is cratered. – Gary Russell May 04 '16 at 20:26
  • Yes exactly that, the broker was burned! Concurrency level 1 is also a temporary solution (i.e. workaround), cluster is stable now. – Ivan Prostran May 05 '16 at 09:43
  • Just to be sure added rabbit-bus.xml to /opt/pivotal/spring-xd/xd/config/META-INF/spring-xd/bus/rabbit-bus.xml, also added channel-cache-size="${spring.rabbitmq.channelCacheSize:200}" to /opt/pivotal/spring-xd/xd/modules/sink/rabbit/config/rabbit.xml , added spring: rabbitmq:channelCacheSize: 200 to server.yml – Ivan Prostran May 05 '16 at 14:57
  • 2016-05-05T17:09:32+0200 1.3.0.RELEASE WARN DeploymentsPathChildrenCache-0 listener.SimpleMessageListenerContainer - CachingConnectionFactory's channelCacheSize can not be less than the number of concurrentConsumers so it was reset to match: 10 (concurrency is 10 .. am I doing sth wrong ?) – Ivan Prostran May 05 '16 at 15:15
  • I believe that's just a warning from the `rabbit` source - in fact, that message/procedure is out of date and we should remove it ([JIRA](https://jira.spring.io/browse/AMQP-607)) - the listener container channels are NEVER cached any more (they are long-lived) so there is no purpose in increasing the cache size on the source. If you want to eliminate the message, increase the cache size in the source `rabbit.xml` but, since there is no caching done on the source side, that's just noise. – Gary Russell May 05 '16 at 16:12
  • I understand that, but sth is still missconfigured. check this out --> cat container-1-start.log | grep "com.rabbitmq.client.ShutdownSignalException: connection error" | wc -l 571 (and this is just one node of 6 in cluster) – Ivan Prostran May 05 '16 at 17:21
  • Generally the number of consumers is decreasing. Sometimes number of consumers is reduced to zero, an I have zombie streams (e.g src|rabbit). Src publish but rabbit doesn't consume it, something is very wrong here. – Ivan Prostran May 05 '16 at 17:28
  • In my opinion there is nothing exotic here, just high message rate. I have event generators that use "routing" through default exchange. typical stream is as follows --> event_generator | rabbit --mappedRequestHeaders=XDRoutingKey --routingKey='headers[''XDRoutingKey'']' – Ivan Prostran May 05 '16 at 17:33
  • I just tested it again (confirming the cache size is set with a debugger) and it worked fine for me. However, while debugging a mistake in my test (wrong) directory, I found a better technique - see my edit - put just the connection factories in a bus extension xml - any filename ending with `.xml` in `xd/config/META-INF/spring-xd/bus/ext/` will be found. – Gary Russell May 05 '16 at 19:25
  • channel-cache-size="${spring.rabbitmq.channelCacheSize:200}" to /opt/pivotal/spring-xd/xd/modules/sink/rabbit/config/rabbit.xml -- remains the same ? (along with ext/cf.xml) – Ivan Prostran May 05 '16 at 20:01
  • Yes, this is just for the bus. It turns out that with this technique, you only need to override the `CachingConnectionFactory`. – Gary Russell May 05 '16 at 20:24
  • Massive exceptions again, 2016-05-05T22:45:28+0200 1.3.0.RELEASE ERROR AMQP Connection 192.168.120.124:5672 connection.CachingConnectionFactory - Channel shutdown: connection error – Ivan Prostran May 05 '16 at 20:52
  • Are you seeing a huge number of channels in the Management UI? With the cache properly configured you shouldn't see more than 2x the total number of consumer threads and generally much less than that because publishing is such a short time event. There has to be something else going on. – Gary Russell May 05 '16 at 20:59
  • consumers 2299, channel 4273 Actually, I've noticed that consumers are becoming dead..that's weird..For example: 15000 ready messages, consumer is sleeping and queue utilitzation is 0%., wtf ??.. Gary thanks for your time, seems I'll have to dig deeper and figure out some radical workaround until this mistery is solved.. – Ivan Prostran May 05 '16 at 21:09
  • Another interesting thing is when I set ackMode=NONE, and deliveryMode=NON_PERSISTENT, CPU utilization is drastically lower and everything seems to work very fine and smoothly. Unfortunatelly there is a catch: after a while dead streams appear. Yes it is very strange but after undeploying/deploying the stream, the data flow is alive again. So producers seem to be ok, it is very clearly - queue is growing. The other side of the PIPE becomes dead --> event_generator|rabbit-sink. (rabbit-sink's internal consumer somehow dies..) queue is sth like this : stream_name.0 (very simple) – Ivan Prostran May 05 '16 at 21:50
  • Weird - I just added some results from a test I just ran with no errors, in case that helps. – Gary Russell May 05 '16 at 21:52
  • I can tell that ackMode=NONE causes consumer instability (i.e consumer count eventually drops to 0), ackMode=AUTO works much better and I don't see any zombi streams (left without consumers) which causes queue growth and finally cluster node eviction. Again there is a catch : there are many UNACKED messages left..Thanks for the test !! – Ivan Prostran May 06 '16 at 07:12