0

We have a requirement wherein we log events in a DynamoDB table whenever an ad is served to the end user. There are more than 250 writes into this table per sec in the dynamoDB table.

We would want to aggregate and move this data to Redshift for analytics.

The DynamoDB stream will be called for every insert made in the table i suppose. How can I feed the DynamoDB stream into some kind of batches and then process those batches. Are there any best practices around such kind of use cases ?

I was reading about apache spark and seems like with Apache Spark we can do such kind of aggregation. But apache spark stream does not read the DynamoDB stream.

Any help or pointers is appreciated.

Thanks

AWS Enthusiastic
  • 176
  • 4
  • 14

2 Answers2

1

DynamoDB streams have two interfaces: low-level API, and Kinesis Adapter. Apache Spark has a Kinesis integration, so you can use them together. In case if you are wondering what DynamoDB streams interface you should use, AWS suggests that Kinesis Adapter is a recommended way.

Here is how to use Kinesis adapter for DynamoDB.

Few more things to consider:

  • Instead of using Apache Spark it is worth looking at Apache Flink. It is a stream-first solution (Spark implements streaming using micro-batching), has lower latencies, higher throughput, more powerful streaming operators, and has support for cycling processing. It also has a Kinesis adapter

  • It can be the case that you don't need DynamoDB streams to export data to Redshift. You can export data using Redshift commands.

Ivan Mushketyk
  • 8,107
  • 7
  • 50
  • 67
  • Hi Ivan, thanks for the response. My tables are huge containing more than 150 million rows. That is reason I do not want to load the entire dynamoDB tables using the copy command provided by redshift. Wanted to do an incremental copy from dynamoDB to redshift and while doing so, aggregate the data. – AWS Enthusiastic Jul 31 '17 at 10:14
  • Then stream processing is a really viable options. Use Apache Flink/Spark and Kinesis adapter to do the trick. – Ivan Mushketyk Jul 31 '17 at 10:18
  • I found this https://github.com/awslabs/dynamodb-streams-kinesis-adapter to convert the dynamoDB stream to Kinesis stream. How do i schedule this application ? Cron job on EC2 ? – AWS Enthusiastic Jul 31 '17 at 11:09
  • Why do you need to schedule it? You can run Spark/Flink and they will run constantly, and process incoming items as they arrive. You can do things like: aggregate items for 5 minute windows and Flink will produce elements exactly every 5 minutes and then you allow you to write them where you want – Ivan Mushketyk Jul 31 '17 at 11:13
  • This is why I suggest to explore Flink since it has more powerful streaming operators. – Ivan Mushketyk Jul 31 '17 at 11:13
  • ok. got it. Sorry for the confusion. New to streaming. Will check on how to configure this adaptor in flink. – AWS Enthusiastic Jul 31 '17 at 11:29
  • No worries! Feel free to up-vote/accept the answer if you found it useful. – Ivan Mushketyk Jul 31 '17 at 12:24
  • Hi Ivan, I that you need to addSource in apache flink to consume Kinesis stream DataStream kinesis = env.addSource(new FlinkKinesisConsumer<>( "kinesis_stream_name", new SimpleStringSchema(), consumerConfig)); How can i consume DynamoDB Stream ? Which class should i be using to consume DynamoDB Stream ? – AWS Enthusiastic Jul 31 '17 at 16:18
  • Hey, I think this question really goes way beyond the area of the original question and StackOverflow does not approve this. I think you need to read more about how you structure Apache Flink applications. You can either take a look at this tutorial: http://training.data-artisans.com/ or you can take this course (I am the author :) ): https://www.pluralsight.com/courses/understanding-apache-flink The streaming in Apache Flink module only takes 40 minutes. – Ivan Mushketyk Jul 31 '17 at 20:33
  • Thanks Ivan. I already went through ur course. It is very informative and precise. Will research more on how to use the dynamodb connector – AWS Enthusiastic Jul 31 '17 at 22:09
0

Amazon EMR provides an implementation of this connector as part of emr-hadoop-ddb.jar, which contains the DynamoDBItemWriteable class. Using this class, you can implement your own DynamoDBInputFormat as shown below.

 public class DynamoDbInputFormat implements InputFormat, Serializable {

    @Override
    public InputSplit[] getSplits(@NonNull final JobConf job, final int numSplits) throws IOException {
        final int splits = Integer.parseInt(requireNonNull(job.get(NUMBER_OF_SPLITS), NUMBER_OF_SPLITS
            + " must be non-null"));

        return IntStream.
            range(0, splits).
            mapToObj(segmentNumber -> new DynamoDbSplit(segmentNumber, splits)).
            toArray(InputSplit[]::new);
}
  • The author of the question suggested that he need to do stream processing. You've provided an answer for how to use MapReduce with DynamoDB. I don't see how MapReduce can perform stream processing. For this you need to use Spark/Flink (see my answer). – Ivan Mushketyk Jul 31 '17 at 10:20
  • Also instead of using MapReduce you can use EMR Hive directly that can use DynamoDB adapter: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMRforDynamoDB.Tutorial.html and allows to run SQL queries instead of writing MapReduce code. – Ivan Mushketyk Jul 31 '17 at 10:22