DynamoDB date GSI

Question

I have a DynamoDB table that stores executions of some programs, this is what it looks like:

Partition Key	Sort Key	StartDate	...
program-name	execution-id (uuid)	YYYY-MM-DD HH:mm:ss	...

I have two query scenarios for this table:

Query by program name and execution id (easy)
Query by start date range, for example: all executions from 2021-05-15 00:00:00 to 2021-07-15 23:59:59

What is the correct way to perform the second query?

I understand I need to create a GSI to do that, but how should this GSI look like?

I was thinking about splitting the StartDate attribute into two, like this:

Partition Key	Sort Key	StartMonthYear	StartDayTime	...
program-name	execution-id (uuid)	YYYY-MM	DD HH:mm:ss	...

So I can define a GSI using the StartMonthYear as the partition key and the StartDayTime as the sort key.

The only problem with this approach is that I would have to write some extra logic in my application to identify all the partitions I would need to query in the requested range. For example:

If the range is: 2021-05-15 00:00:00 to 2021-07-15 23:59:59

I would need to query 2021-05, 2021-06 and 2021-07 partitions with the respective day/time restrictions (only the first and last partition is this example).

Is this the correct way of doing this or am I totally wrong?

How many different programs are there? Is there also an access pattern that asks for the executions of a certain program in a time range? — Maurice, Jul 29 '21 at 13:41
There are like 500 programs and there is not access pattern like that. Time range will always include all programs. — Pedro Henrique, Jul 29 '21 at 13:50

score 1 · Accepted Answer · answered Jul 29 '21 at 14:08

If you quickly want to fetch all executions in a certain time-frame no matter the program, there are a few ways to approach this. The easiest solution would be a setup like this:

PK	SK	GSI1PK	GSI1SK	StartDate
`PROG#<name>`	`EXEC#<uuid>`	`ALL_EXECUTIONS`	`S#<yyyy-mm-ddThh:mm:ss>#EXEC<uuid>`	`yyyy-mm-ddThh:mm:ss`

PK is the partition key for the base table
SK is the sort key for the base table
GSI1PK is the partition key for the global secondary index GSI1
GSI1SK is the sort key for the global secondary index GSI1

Query by program name and execution id (easy)

Still easy, do a GetItem based on the program name for <name> and uuid for <uuid>.

Query by start date range, for example: all executions from 2021-05-15 00:00:00 to 2021-07-15 23:59:59

Do a Query on GSI1 with the KeyConditionExpression: PK = ALL_EXECUTIONS AND SK >= 'S#2021-05-15 00:00:00' AND SK <= 'S#2021-07-15 23:59:59'. This would return all the executions in the given time range.

But: You'll also build a hot partition, since you effectively write all your data in a single partition in GSI1.

To avoid that, we can partition the data a bit and the partitioning depends on the number of executions you're dealing with. You can choose years, months, days, hours, minutes or seconds.

Instead of GSI1PK just being ALL_EXECUTIONS, we can set it to a subset of the StartDate.

PK	SK	GSI1PK	GSI1SK	StartDate
`PROG#<name>`	`EXEC#<uuid>`	`EXCTS#<yyyy-mm>`	`S#<yyyy-mm-ddThh:mm:ss>#EXEC<uuid>`	`yyyy-mm-ddThh:mm:ss`

In this case you'd have a monthly partition, i.e.: all executions per month are grouped. Now you would have to make multiple queries to DynamoDB and later join the results.

For the query range from 2021-05-15 00:00:00 to 2021-07-15 23:59:59 you'd have to do these queries on GSI1:

@GSI1: GSI1PK=EXCTS#2021-05 AND GSI1SK >= S#2021-05-15 00:00:00
@GSI1: GSI1PK=EXCTS#2021-06
@GSI1: GSI1PK=EXCTS#2021-07 AND GSI1SK <= S#2021-07-15 23:59:59

You can even parallelize these and later join the results together.

Again: Your partitioning scheme depends on the number of executions you have in a day and also which maximum query ranges you want to support.

This is a long-winded way of saying that your approach is correct in principle, but you can choose to tune it based on your use case.

How bad is it to have that hot partition? The query that will use that GSI will run only once per day. Is it as bad as performing a scan first and then filtering based on the range? — Pedro Henrique, Jul 29 '21 at 14:26
It will be cheaper than the scan, because DynamoDB won't have to examine all items. If the query is running once per day it's probably fine, you should have enough burst capacity stored up for that. All that depends on the amount of records you expect to deal with. If we're talking dozens per day you'll be fine for a much longer time than if it would me thousands. — Maurice, Jul 29 '21 at 14:33

DynamoDB date GSI

1 Answers1