Consume DynamoDB streams in Apache Flink

Question

Has anyone tried to consume DynamoDB streams in Apache Flink ?

Flink has a Kinesis consumer. But I am looking for how can i consume the Dynamo stream directly.

DataStream<String> kinesis = env.addSource(new FlinkKinesisConsumer<>(
    "kinesis_stream_name", new SimpleStringSchema(), consumerConfig));

I tried searching a lot, but did not find anything. However found an open request pending the Flink Jira board. So I guess this option is not available yet ? What alternatives do I have ?

Allow FlinkKinesisConsumer to adapt for AWS DynamoDB Streams

Ivan Mushketyk · Accepted Answer · 2019-08-11T20:30:53.117

2

UPDATED ANSWER - 2019

FlinkKinesisConsumer connector can now process a DynamoDB stream after this JIRA ticket is implemented.

UPDATED ANSWER

It seems that Apache Flink does not use the DynamoDB stream connector adapter, so it can read data from Kinesis, but it can't read data from DynamoDB.

I think one option could be implement an app that would write data from DynamoDB streams to Kinesis and then read data from Kinesis in Apache Flink and process it.

Another option would be to implement custom DynamoDB connector for Apache Flink. You can use existing connector as a starting point.

Also you can take a look at the Apache Spark Kinesis connector. But it seems that it has the same issue as well.

ORIGINAL ANSWER

DynamoDB has a Kinesis adaptor that allow you to consume a stream of DynamoDB updates using Kinesis Client Library. Using Kinesis adaptor is a recommended way (according to AWS) of consuming updates from DynamoDB. This will give you same data as using DynamoDB stream directly (also called DynamoDB low-level API).

edited Aug 11 '19 at 20:30

answered Aug 01 '17 at 11:45

Ivan Mushketyk

8,107
7
50
67

Hi Ivan, There is not enough documentation around how to use this adaptor. I ran the sample code, but that still does not give me an idea on how to add a source in flink. Do I have to create a custom source in flink ? Flink documentation also does not say much. A sample example would be helpful. – AWS Enthusiastic Aug 01 '17 at 15:40
I was thinking to write lambda function to write the dynamodb stream to kinesis. Not sure how good the solution would be from performance and cost perspective. With more than 500 writes per second, the number of times the lambda function called will be a big number. Is that a good option? – AWS Enthusiastic Aug 01 '17 at 21:37
500 writes per second is roughly 21M writes per month. If you allocate 128MB for this function and each execution will take 0.2 seconds you will pay around $4 per month according to this calculation (https://aws.amazon.com/lambda/pricing/) – Ivan Mushketyk Aug 02 '17 at 05:52
I am doing something wrong. 500writes per second is 500 * 60 * 60* 24 ~ 43 M invocations per day. So the cost is 8 $ per day and I have 3 such tables. So it will be 24 $ per day. Would have been better if the Lambda function was called for a batch rather than every single insert :) – AWS Enthusiastic Aug 02 '17 at 06:30
Ok, your calculations are correct, yep, it's $8 per day. But keep in mind that you pay for execution time, not per invocation. It assumes that your function will work for 0.2 seconds, if it's faster you will pay less. I don't think it can be executed per batch. – Ivan Mushketyk Aug 02 '17 at 11:04
You can write data from DynamoDB streams into Kinesis, and than process it with Spark/Flink. But it probably will be more expensive, but at the same time you will be free to do more complicated analysis. – Ivan Mushketyk Aug 02 '17 at 11:06
I figured out that the lambda invocations are actually less. A single invocation receives a streams which contains multiple records. So ideally the cost will be less. – AWS Enthusiastic Aug 02 '17 at 14:57
@Ivan Mushketyk what about this https://issues.apache.org/jira/browse/FLINK-4582 ? – sri hari kali charan Tummala Aug 07 '19 at 20:18

Consume DynamoDB streams in Apache Flink

1 Answers1

Linked