1

I need to build a service using aws tools which aggregates data from various dynamodb tables and stores the data in a redshift cluster. There also needs to be processing done to each data stream before it is stored in redshift.

My current idea is to send each data stream through dynamodb streams to kinesis data analytics, with each stream having its own kinesis component. Each kinesis component will do processing on the data and then write the processed data to the same redshift cluster.

I fear that this is not scalable and was wondering if there is any way to have one single service take multiple input streams, do processing, and then send the processed data to the redshift cluster? This way, for each new dynamodb table or s3 bucket, we don't need to create a whole new kinesis analytics component.

For reference, the data stored in each of the dynamodb tables is not the same and neither is the processed data.

There is an extremely large amount of data being used and the updates need to be handled in realtime.

szatz
  • 11
  • 3
  • Does it have to be realtime? – Robert Kossendey Jun 15 '21 at 06:56
  • Redshift is for when you have at *least* a terabyte of data. It gives you timely SQL on Big Data; *nothing else*. No lots of users, no fast data, no small data, *nothing*. I may be wrong, but I think, from what I little I know of your use case - what you have written here - Redshift is categorically the wrong choice, and you should stick to a conventional, unsorted row-store database. –  Jun 15 '21 at 07:40
  • Thanks for your feedback, I have edited my original post to clarify my use case. I am working with extremely large data that needs to be updated in real time. The reason I am using redshift is to be able to execute queries on this data quickly, as well as the fact that the columnar storage fits my data well. I would love to hear other suggestions though as I am not experienced with AWS :-) – szatz Jun 15 '21 at 13:58
  • It seems I may be able to use KCL or a lambda function with multiple triggers to consume multiple input streams. If anyone has an opinion on this I'd love to hear! [kcl multiple streams](https://aws.amazon.com/about-aws/whats-new/2020/10/kinesis-client-library-enables-multi-stream-processing/) [lambda multiple triggers](https://docs.aws.amazon.com/lambda/latest/dg/lambda-invocation.html) – szatz Jun 15 '21 at 14:18
  • We do something very similar to this : DynamoDb -> Stream -> Lambda function -> kinesis Stream -> S3 -> Lambda -> Google BigQuery Each Lambda function in the pipeline does some sort of processing on data at each level. Similarly you can opt for : DynamodDB -> Stream -> lambda -> Redshift You can point multiple streams to the same lambda function as well. – pankajanand18 Jun 17 '21 at 01:06

0 Answers0