3

I have problems implementing dynamodbstreams. We want to get records of changes right at the time the dynamodb table is changed.

We've used the java example from https://docs.aws.amazon.com/en_en/amazondynamodb/latest/developerguide/Streams.LowLevel.Walkthrough.html and translated it for our c++ project. Instead of ShardIteratorType.TRIM_HORIZON we use ShardIteratorType.LATEST). Also I am currently testing with an existing table and do not know how many records to expect.

Most of the time when iterating over the shards I retrieve from Aws::DynamoDBStreams::DynamoDBStreamsClient and the Aws::DynamoDBStreams::Model::DescribeStreamRequest I do not see any records. For testing I change entries in the dynamodb table through the aws console. But sometimes (and I do not know why) there are records and it works as expected.

I am sure that I misunderstand the concept of streams and especially of shards and records. My thinking is that I need to find a way to find the most recent shard and to find the most recent data in that shard.

Isn't this what ShardIteratorType.LATEST would do? How can I find the most recent data in my stream?

I appreciate all of your thoughts and am curious about what happens to my first stackoverflow post ever.

Best David

David
  • 33
  • 3

1 Answers1

3

How can I find the most recent data in my stream?

How would you define the most recent data? Last 10 entries? Last entry? Or data that is not yet in the shard? The question may sound silly but the answer makes a difference.

The option - LATEST - that you are using is going to set the head of the iterator right after the last entry which means that unless new data arrives after the iterator has been created, there will be nothing to read.

If by the most recent data you mean some records that are already in the shard then you can't use LATEST. The easy option is to use TRIM_HORIZON.

Or even easier would be to subscribe lambda function to that stream that will automatically be invoked whenever a new record is put into the stream (with the record being passed to that lambda function as payload), which might be preferable if you need to handle events in near-real time.

Matus Dubrava
  • 13,637
  • 2
  • 38
  • 54
  • Thanks @Matus, that clarified some things! Most recent data then defines data that is not in the shard yet. I now sorted the shards by their SequenceStartingNumber. Using LATEST I only iterate over the the last (most recent?) shard. Now I do get records everytime I change the first two items in the table. Changing the third / last entry, creating new entries or deleting entries does not write any records. The streams StreamViewType is set to NEW_AND_OLD_IMAGES. Is it possible that they are written to another shard? And: We'd have to use one lambda per table, right? Thanks again! David – David Jul 15 '20 at 10:36
  • That is a weird behavior but it can have something to do with the distributed nature of the shards. I do believe that shards are internally replicated across multiple servers and the iterator might be polling only a subset of them (I know for fact that when using SQS queue with short polling, your HTTP request will poll only a subset of the servers resulting possibly in an empty list while there actually are entries in the queue). Whether this is the case with shard iterators or not, that I don't know. – Matus Dubrava Jul 15 '20 at 12:24
  • Also, there is this line in docs for shard iterator `Note that it might take multiple calls to get to a portion of the shard that contains stream records`. I don't know what exactly they mean by that but it might be related. Again, note that iterators are mostly used for batch processing when you don't care about fetching the latest records. Whether the records can be sent to a different shard? Well, they can but not to parent ones. Previous shards are closed and they are read only but DynamoDB service may spawn new shards based on the usage. – Matus Dubrava Jul 15 '20 at 12:30
  • You do not need one lambda function per table but you will need one trigger per table which can be multiple instances of the same lambda function, assuming that you can handle streams from multiple tables with the same kind of code. If you can, you can freely reuse the same function. Last thing, I would definitely use the lambda function if I cared only about the latest entry. Also, you could set up the trigger just to observe whether the stream is working as expected. Every event should trigger the lambda automatically without worrying about the above mentioned stuff. – Matus Dubrava Jul 15 '20 at 12:35
  • Thanks again @Matus! Hopefully my last question: do I have to use another service like a kinesis stream to trigger code in my desktop application with a lambda function? dynamodb would trigger a lambda which would write to a stream – is there a more straightforward way? – David Jul 16 '20 at 20:15
  • I am sorry, I don't understand the question (the described flow of events). Could you be more explicit about what you are trying to achieve? – Matus Dubrava Jul 16 '20 at 21:11
  • Sorry for not being clear about that! I need a feedback/event on selected devices when the dynamodb table was changed. My thinking was: I can somehow "subscribe" to a lambda to see if it was triggered by the dynamodb table and read the changes made to the table. But I do not see any entry points in the lambda classes to achieve that. – David Jul 17 '20 at 06:15
  • maybe the following use cases clarifies what we are trying to achieve - visually identical App on Device A and Device B - Turn on a toggle on Device A - Toggle state (bool) gets written to DynamoDB - That change is picked up by Device B and the toggle turns on there as well – David Jul 17 '20 at 07:55
  • These services communicate over HTTP/S which is a stateless protocol. To achieve that, you will need some kind of persistent connection such as websockets so that the server can initiate communication back to client. It is not going to be as simple as using lambda trigger or iterator short polling. You will need to add some additional components to the setup. There is a lot of different architectures that can be used but you might start with looking into `appsync` service with websockets. – Matus Dubrava Jul 17 '20 at 09:22
  • So am I understanding this correctly, that there is no SDK-only option to handle this? Theoretically we have all the tools to do https based things at our disposal (we're writing in Juce/C++, so a huge framework containing everything necessary)... but we would prefer to stay within the bounds of the AWS C++ SDK when using AWS. Is this not possible at all? – David Jul 17 '20 at 16:32
  • No, I am just saying that you will need websockets instead of HTTPS unless you want to spam your system with HTTP requests. Ideally you want a push based notification system instead of poll based system, especially if you want near-real time performance which prevents you from executing some kind of long polling. There is definitely a way to do this with AWS-SDK but you will need more than just DynamoDB client and Lambda. I do believe that you `appsync` service + its SDK client will allow you to do this. https://docs.aws.amazon.com/appsync/latest/devguide/real-time-websocket-client.html – Matus Dubrava Jul 17 '20 at 18:36
  • thanks so much for your helpful advice! We will have a look at appsync – I will update or answer my question as soon as we've managed to implement appsync with the c++ sdk. – David Jul 20 '20 at 07:11