Bottle-necking in Stream Apps Causes Lost Messages

Question

Spring Cloud Data Flow (SCDF) Server for Cloud Foundry v1.4.x
RabbitMQ service tile provisioned for message transport

A deployed spring cloud data flow stream has a processor that can produce outgoing messages more quickly than a downstream processor or sink can process incoming messages. This causes a bottleneck in the RabbitMQ transport that eventually results in losing messages.

In our private cloud environment, our Rabbit service tile has default settings of max-length=1000 and max-length-bytes=1000000. We are currently unable to modify these settings to increase either of these capacities.

We have tried setting the prefetch value on the consuming application (I believe the setting would be deployer.<appname>.rabbit.bindings.consumer.prefetch=10000), which seems to virtually increase the consuming application's ability to consume more messages in a shorter period of time, but this would only be effective in limited scenarios. If we have an extremely large volume of data going through the stream, we are still likely to hit a limitation where messages get dropped. In the above example, we seem to increase the capacity of the consuming application from 1000 to 11000 by setting prefetch.

We have also attempted to make use of an auto-scaling service, so we can increase the number of active instances of the consuming application, which can also obviously increase its capacity. This, however, also seems like addressing the problem with a band-aid, rather than using infrastructure and/or services that are inherently able to handle underlying volume expectations in an elastic manner. What if we do not know specific times of day when volumes are going to increase significantly, and what if the volume increases at a rate such that the CPU thresholds in an auto-scaler setting can not keep up with the active instances quickly enough to avoid lost messages?

we have not tried setting the RabbitMQ service to guarantee delivery. Based on the documentation, it seems like it is easier to tell when a message was not delivered, rather than make delivery a certainty. we do not know whether this is a good viable option, and are looking for advice.
we have not tried implementing any throttling in our stream apps themselves. we do not know whether this is a good viable option, and are looking for advice.
we have not tried binding apps to a DLQ or re-queueing messages that fail processing. we do not know whether this is a good viable option, and are looking for advice.
we have not tried binding the SCDF server to our own Rabbit service instance outside the Cloud Foundry service tiles. This would theoretically be a RabbitMQ instance that we would have more control over for queue depth and byte size limitations where we could set them to more easily handle our expected loads.
we have not tried an alternative transport mechanism like Kafka. we do not know whether this is a good viable option, and are looking for advice.

I would find it hard to believe that others have not experienced a problem of a similar nature in these streaming paradigms, and I'm curious whether there is an accepted best practice solution, or if we need to take a closer look at whether the streaming paradigm is being mis-used in these situations.

Our basic requirements are such that losing messages in any streaming application context is an unacceptable situation, and we need to determine a best way to approach configuring our environment, or analyzing our implementation choices to ensure our implementations are robust and reliable under heavy load.

Any advice from the community, or from the Pivotal folks on this?

Oleg Zhurakousky · Accepted Answer · 2019-05-01T18:21:32.730

Channing

Thank you for providing so much details, questions and for your interest in both Spring Cloud Stream and SCDF, but I hope you understand that this is not really a question for SO as it has so many variables that it can not possibly have an answer and would be more suited for a discussion of some type. Perhaps a feature request in GitHub for either of the projects mentioned and we can debate it there. In any event, I’ll do my best to ensure it does not go unanswered.

What you’re asking about is back pressure and indeed it is a very valid question. However there need to be an understanding that Spring Cloud Stream and subsequently SCDF chose to support multiple messaging systems/protocols (via binders) to connect micro services together instead of creating our own. And not every messaging system/protocol supports back pressure and the once that do provide different mechanisms to achieve it, thus making it difficult/impossible to provide some kind of a common abstraction at the framework level.

So effectively it becomes more of an architecture/design discussion, which I would be more then glad to engage but would need more context. For example, within the context of RabbitMQ, one way could be for the producer to poll the queue size (RabbitAdmin.queueProperties(queue)) and stop publishing if it goes over some threshold. But as I said, there are many more tricks and ways to do things and we would definitely need more context.

I should also mention that we are working on RSocket binder which is a system and protocol that natively supports back pressure.

I hope this helps . . .

Thanks for the follow up Oleg. I appreciate your take on this being a discussion rather than a question, and I can certainly agree to that. One follow-up question if I may - I did do a little research yesterday on how to get at the RabbitAdmin so I could inspect the outgoing channel of my stream apps, and came up kind of empty. Do you have any guidance on that at all? — Channing Jackson, May 03 '19 at 15:49

Bottle-necking in Stream Apps Causes Lost Messages

1 Answers1