2

I need to process, at peak, 100s of records per second. Those records are simple JSON bodies and they should be collected and then processed/transformed into a database.

A few questions ...

1) Is Kinesis right for this? Or is SQS better suited?

2) When using kinesis, do I want to use the python examples as shown here: https://aws.amazon.com/blogs/big-data/snakes-in-the-stream-feeding-and-eating-amazon-kinesis-streams-with-python/ or should I be implementing my producer and consumer in KCL? What's the difference?

3) Does Kinesis offer anything to the management of the consumers, or do I just run them on EC2 instances and manage them myself?

4) What is the correct pattern for accessing data - I can't afford to miss any records, so I assume I would be fetching records from "TRIM_HORIZON" and not "LATEST". If so, how do I manage duplicates? In other words, how do my consumers get records from the stream and handle consumers going down, etc and always know they are fetching all the records?

Thanks!

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
mr-sk
  • 13,174
  • 11
  • 66
  • 101
  • what kind of processing do you plan to do? do you care about messages maintaining their order? – ketan vijayvargiya Feb 08 '17 at 09:11
  • Hey - messages don't have to maintain order and the only processing I'll be doing by the consumer is transforming into a different format and forwarding to another service. – mr-sk Feb 08 '17 at 14:40

1 Answers1

2
  1. Kinesis is more useful for streaming data or when you require strict ordering between messages. You use case, on the other hand, seems to be more like a buffering solution between two services. So, I'd prefer SQS to Kinesis. SQS is also cheaper and simpler to work with and should easily handle your required scale.
  2. The example you shared uses low-level APIs of Kinesis. However, you should prefer using KPL and KCL for implementing your producers and consumers respectively, as they provide higher level constructs that are easier to use.
  3. You can run both Kinesis and SQS producers and consumers on EC2 or on Lambda. In the latter, AWS will take care of your hardware management.
  4. Yes, you should go with TRIM_HORIZON. If there are duplicates in your data, your consumers should take care of them by doing some bookkeeping on their own. As for consumers going down etc., KCL handles those cases gracefully.
ketan vijayvargiya
  • 5,409
  • 1
  • 21
  • 34
  • Thanks for the answer. Questions: 1) I'll reconsider SQS as the solution. Thanks. 2) KPL and KCL look more "complicated" to run with less documentation than the SDK API. Also, it looks like they only run on Redhat/RHEL. (At least from my quick read through the installation documentation). 3) Got it, that makes sense, need to read up on that also. 4) So, if I go with TRIM_HORIZON, the consumer will start reading at the beginning of the stream...how do I mark the location of where I am in the stream. Is that the shard_iterator that I would keep track of or something else? – mr-sk Feb 08 '17 at 16:26
  • i don't know if you use the low level APIs. but KCL automatically writes checkpoints in DynamoDB, so you don't have to worry about it on your own – ketan vijayvargiya Feb 08 '17 at 16:33