6

I have a dynamodb table that stores historical run data for processes that run on my server, I need a place where I can aggregate these processes and see the data for the latest of each of these. Each process has it's own ProcessId which is the partition key for the dynamodb table. The sort key is the StartDateTime

{
  ProcessId, // Partition Key
  StartDateTime, // Sort Key
  ... // More data
}

Essentially I need to retrieve the most recent StartDateTime for each ProcessId that I give. I'm using a nodejs lambda with the aws-sdk to retrieve the data. I've looked into using BatchGetItem but my understanding is that for tables with a Partition Key and Sort Key, you need to provide both to retrieve an item. I've also looked into using a Query, but I would need to run a separate query for each Partition which is less than Ideal. Does anyone know of a way I can make this request in one call rather than having to make a separate call per Partition?

Luke
  • 63
  • 1
  • 5
  • Not aware that you can do this in a single API call. I expect you'll have to retrieve the unique ProcessId values, then issue a single query for each, with `Limit=1` and `ScanIndexForward=false` to sort by descending StartDateTime and retrieve the latest. – jarmod Jan 10 '20 at 01:00

2 Answers2

6

To sum up what I understood from your post you may have data like this in your table:

PK (id)         SK (timestamp)    Other data
process1        1                 ...
process2        4                 ...
process1        8                 ...
process3        18                ...
process2        25                ...

Your need is to easily retrieve:

process1        8                 ...
process2        25                ...
process3        18                ...

As sandboxbohemian said, I suggest you a stream to trigger a lambda function each time a new input arrives. However, I would use the same table and upsert an item with the same id and a timestamp equal to 0. In addition I add a binary attribute "latest" with always set to "True" and a number attribute for the current timestamp. Chronologically the entries would be:

PK (id)         SK (timestamp)    Other data      timestamp2(GSI SK)  latest (GSI PK)
process1        1                 ...                      
process1        0                 ...             1                   true
process2        4                 ...                      
process2        0                 ...             4                   true
process1        8                 ...                      
process1        0                 ...             8        
process3        18                ...                      
process3        0                 ...             18                  true       
process2        25                ...                      
process2        0                 ...             25                  true       

Then you have to create a GSI with PK equals to "latest" and SK equals to "timestamp" and project "id" and "data" attributes. It will be a sparse index meaning that only item with a latest attribute filled in will be present. Here after is the content:

latest (GSI PK) timestamp2 (GSI SK)   id        timestamp   Data
true            8                     process1  0           ...
true            25                    process2  0           ...    
true            18                    process3  0           ...   

As you see the the PK has always the same value. Therefore it allows doing a query or a scan. If you need all last process you can make a scan. If the number of process is really high you can make a query with latest=True and take advantage of sorting capabilities regarding timestamp2.

I agree this schema is not intuitive but it is often the case with dynamodb

ben11
  • 104
  • 3
  • This is a great answer! Keep contributing Ben & get that rep up! – Derrops Apr 20 '20 at 05:27
  • This maybe ok for a small dataset, but it's often advised to use high cardinality values for PK: https://aws.amazon.com/blogs/database/choosing-the-right-dynamodb-partition-key/. Here's a great Re:Invent session as well on modelling DynamoDB data: https://www.youtube.com/watch?v=HaEPXoXVf2k – peter n Jun 26 '20 at 03:18
  • Great way of doing it. Would you advise to always use a trigger, or can you do it in the lambda function that does the original insert? – xtra Aug 03 '21 at 18:53
1

You appear to be trying a sort of aggregation, and DynamoDB is typically not best suited for aggregations, but more for CRUD style operations.

Instead of running expensive queries or scans, try enabling DynamoDB Streams on the table, and using another lambda to 'upsert' the start time in another DynamoDB table with the processId as the partition key.

Then you can run your query for the latest start time on the processId on this new table.