- Spring Cloud Data Flow (SCDF) Server for Cloud Foundry v1.4.x
- RabbitMQ service tile provisioned for message transport
A deployed spring cloud data flow stream has a processor that can produce outgoing messages more quickly than a downstream processor or sink can process incoming messages. This causes a bottleneck in the RabbitMQ transport that eventually results in losing messages.
In our private cloud environment, our Rabbit service tile has default settings of max-length=1000
and max-length-bytes=1000000
. We are currently unable to modify these settings to increase either of these capacities.
We have tried setting the prefetch
value on the consuming application (I believe the setting would be deployer.<appname>.rabbit.bindings.consumer.prefetch=10000
), which seems to virtually increase the consuming application's ability to consume more messages in a shorter period of time, but this would only be effective in limited scenarios. If we have an extremely large volume of data going through the stream, we are still likely to hit a limitation where messages get dropped. In the above example, we seem to increase the capacity of the consuming application from 1000 to 11000 by setting prefetch.
We have also attempted to make use of an auto-scaling service, so we can increase the number of active instances of the consuming application, which can also obviously increase its capacity. This, however, also seems like addressing the problem with a band-aid, rather than using infrastructure and/or services that are inherently able to handle underlying volume expectations in an elastic manner. What if we do not know specific times of day when volumes are going to increase significantly, and what if the volume increases at a rate such that the CPU thresholds in an auto-scaler setting can not keep up with the active instances quickly enough to avoid lost messages?
- we have not tried setting the RabbitMQ service to guarantee delivery. Based on the documentation, it seems like it is easier to tell when a message was not delivered, rather than make delivery a certainty. we do not know whether this is a good viable option, and are looking for advice.
- we have not tried implementing any throttling in our stream apps themselves. we do not know whether this is a good viable option, and are looking for advice.
- we have not tried binding apps to a DLQ or re-queueing messages that fail processing. we do not know whether this is a good viable option, and are looking for advice.
- we have not tried binding the SCDF server to our own Rabbit service instance outside the Cloud Foundry service tiles. This would theoretically be a RabbitMQ instance that we would have more control over for queue depth and byte size limitations where we could set them to more easily handle our expected loads.
- we have not tried an alternative transport mechanism like Kafka. we do not know whether this is a good viable option, and are looking for advice.
I would find it hard to believe that others have not experienced a problem of a similar nature in these streaming paradigms, and I'm curious whether there is an accepted best practice solution, or if we need to take a closer look at whether the streaming paradigm is being mis-used in these situations.
Our basic requirements are such that losing messages in any streaming application context is an unacceptable situation, and we need to determine a best way to approach configuring our environment, or analyzing our implementation choices to ensure our implementations are robust and reliable under heavy load.
Any advice from the community, or from the Pivotal folks on this?