Avoiding data loss when slow consumers force backpressure in stream processing (spark, aws)

Question

I'm new to distributed stream processing (Spark). I've read some tutorials/examples which cover how backpressure results in the producer(s) slowing down in response to overloaded consumers. The classic example given is ingesting and analyzing tweets. When there is an unexpected spike in traffic such that the consumers are unable to handle the load, they apply backpressure and the producer responds by adjusting its rate lower.

What I don't really see covered is what approaches are used in practice to deal with the massive amount of incoming real-time data which cannot be immediately processed due to the lower capacity of the entire pipeline?

I imagine the answer to this is business domain dependent. For some problems it might be fine to just drop that data, but in this question I would like to focus on a case where we don't want to lose any data.

Since I will be working in an AWS environment, my first thought would be to "buffer" the excess data in an SQS queue or a Kinesis stream. Is it as simple as this in practice, or this there a more standard streaming solution to this problem (perhaps as part of Spark itself)?

score 3 · Answer 1 · answered Apr 09 '18 at 14:22

"Is there a more standard streaming solution?" - Maybe. There are a lot of different ways to do this, not immediately clear if there is a "standard" yet. This is just an opinion though, and you're not likely to get a concrete answer for this part.

"Is it as simple as this in practice?" - SQS and Kinesis have different usage patterns:

Use SQS if you want to always process all messages, AND have a single logical consumer
- think of this like a classic queue where messages need to be "consumed" from the queue.
- definitely a simpler model to understand and get going with, but it essentially acts as a buffer
Use Kinesis if you want to easily skip messages, OR have multiple logical consumers
- think of this like a stream with 'skip' functionality, where multiple consumers can be at different points in the stream, or filtering to consume specific message types
- requires a more effort to use, but provides more options for message consumption and dealing with backpressure
- note that (by default) Kinesis will drop data in the stream if it's more than 24h old, though this can be increased to 7 days

For your use case where you have a "massive amount of incoming real-time data which cannot be immediately processed", I'd focus your efforts on Kinesis over SQS, as the Kinesis model also aligns better with other streaming mechanisms like Spark / Kafka.

Thank you for the good information on kinesis vs SQS. For an answer I was hoping for a concrete example of how this problem is being solved in practice. — andrasp, Apr 10 '18 at 01:09
I've done it probably 5 different ways for different projects I've worked on with different requirements. The simplest way is to model the gap between producer/consumer (based on metrics of how many items in the queue / how far behind 'current' in the stream), and adjusting either producer or consumer behavior based on the value of that metric (autoscale consumers, or produce different type or less items). If you can describe the behavior, you can model it and code it. — Krease, Apr 10 '18 at 02:52

Avoiding data loss when slow consumers force backpressure in stream processing (spark, aws)

1 Answers1

Linked