5

I intend to use DynamoDB streams to implement a log trail that tracks changes to a number of tables (and writes this to log files on S3). Whenever a modification is made to a table, a lambda function will be invoked from the stream event. Now, I need to record the user that made the modification. For put and update, I can solve this by including an actual table attribute holding the ID of the caller. Now the record stored in the table will include this ID, which isn't really desirable as it's more meta-data about the operation than part of the record itself, but I can live with that.

So for example:

put({
  TableName: 'fruits',
  Item: {
    id: 7,
    name: 'Apple',
    flavor: 'Delicious',
    __modifiedBy: 'USER_42'
  })

This will result in a lambda function invocation, where I can write something like the following to my S3 log file:

table: 'fruits',
operation: 'put',
time: '2018-12-10T13:35:00Z',
user: 'USER_42',
data: {
    id: 7,
    name: 'Apple',
    flavor: 'Delicious',
}

However, for deletes, a problem arises - how can I log the calling user of the delete operation? Of course I can make two requests, one that updates the __modifiedBy, and another that deletes the item, and the stream would just fetch the __modifiedBy value from the OLD_IMAGE included in the stream event. However, this is really undesirable, having to spend 2 writes on a single delete of an item.

So is there a better way, such as attaching metadata to DynamoDB operations, that are carried over into stream events, without being part of the data written to the table itself?

JHH
  • 8,567
  • 8
  • 47
  • 91

1 Answers1

6

Here are 3 different options. The right one will depend on the requirements of your application. It could be that none of these will work in your specific use case, but in general, these approaches will all work.

Option 1

If you’re using AWS IAM at a granular enough level, then you can get the user identity from the Stream Record.

Option 2

If you can handle a small overhead when writing to dynamodb, you could set up a lambda function (or ec2-based service) which acts as a write proxy to your dynamodb tables. Configure your permissions so that only that Lambda can write to the table, and then you can accept any metadata you want and log it however you want. If all you need is logging of events, then you don’t need to write to S3, since AWS can handle Lambda logs for you.

Here’s an example pseudo code for a lambda function using logging instead of writing to S3.

handle_event(operation, item, user)
    log(operation, item, user)
    switch operation
        case put:
             dynamodb.put(item)
        case update:
             dynamodb.update(item)
        case delete:
             dynamodb.delete(item)

log(operation, item, user)
    logEntry.time = now
    logEntry.user = user
    ...
    print(logEntry)

You are, of course, free to still log directly to S3, but if you do, you may find that the added latency is significant enough to impact your application.

Option 3

If you can tolerate some stale data in your table, set up DynamoDB TTL on your table(s). Don’t set a TTL value when creating or updating an item. Then instead of deleting an item, update the item by adding the current time to the TTL field. As far as I can tell, DynamoDB does not use write capacity when removing items with an expired TTL, and expired items are removed with 24 hours of their expiry.

This will allow you to log the “add TTL” as the deletion and have a last modified by user for that deletion. You can safely ignore the actual delete that occurs when dynamodb cleans up the expired items.

In your application, you can also check for the presence of a TTL value so that you don’t present users with deleted data by accident. You could also add a filter expression to any queries that will omit items which have a TTL set.

Matthew Pope
  • 7,212
  • 1
  • 28
  • 49
  • These are all good suggestions, thanks. I had thought about option 1) but I am not using that kind of IAM setup, so I need a custom user id from my application layer to be logged as the "author" of all operations. Option 3 is something I didn't consider at all, that's actually very nice. However, I would have to change quite a lot of my other queries to exclude TTL:ed items from the result sets, which probably will cause me some headache. If I'd done something from scratch I probably would've used this solution though. – JHH Dec 11 '18 at 12:30
  • Remains option 2, and I'll probably end up doing a variant of that, but instead of an API layer running in the cloud that wraps the db calls I'll probably do it on the client side. I actually already have a layer / DB wrapper around all my database calls, and instead of being triggered from Dynamo I could actually submit the event from there directly. While I would've preferred to decouple my log trails from application logic this will probably be the easiest in the end. – JHH Dec 11 '18 at 12:32
  • The downside will be that any potential "direct" DynamoDB calls will sneak through without being logged, but that could be the case also with streams and lambda functions if using a custom format for all the operations (including modifiedTime, modifiedBy, custom delete scheme using TTL etc) so maybe it's not really an issue. – JHH Dec 11 '18 at 12:33
  • 1
    If you use option 2 with a lambda function as a proxy, you can actually prevent any “sneaky” calls from bypassing the proxy by setting an IAM policy which blocks all dynamodb puts, updates, and deletes if they’re not from that lambda function. Option 3 could also be implemented with a proxy, and that might help to limit the number of places that you need to change your code. – Matthew Pope Dec 11 '18 at 13:27
  • 1
    i dont think option 1 is applicable. the userIdentity field is only populated with the records are deleted using TTL field. On all the other cases the field UserIdentity isnt available in the record – Parv Sharma May 08 '20 at 07:38