2

We have a Dynamodb table Events with about 50 million records that look like this:

{
 "id": "1yp3Or0KrPUBIC",
 "event_time": 1632934672534,
 "attr1" : 1,
 "attr2" : 2,
 "attr3" : 3,
 ...
 "attrN" : N,
}

The Partition Key=id and there is no Sort Key. There can be a variable number of attributes other than id (globally unique) and event_time, which are required.

This setup works fine for fetching by id but now we'd like to efficiently query against event_time and pull ALL attributes for records that match within that range (could be a million or two items). The criteria would be equal to something like WHERE event_date between 1632934671000 and 1632934672000, for example.

Without changing any existing data or transforming it through an external process, is it possible to create a Global Secondary Index using event_date and projecting ALL attributes that could allow a range query? By my understanding of DynamoDB this isn't possible but maybe there's another configuration I'm overlooking.

Thanks in advance.

J-Deq87
  • 101
  • 10

2 Answers2

5

(Edit: I rewrote the answer because the OP's comment clarified that the requirement is to query event_time ranges ignoring id. OP knows the table design is not ideal and is trying to make the best of a bad situation).

Is it possible to create a Global Secondary Index using event_date and projecting ALL attributes that could allow a range query?

Yes. You can add a Global Secondary Index to an existing table and choose which attributes to project. You cannot add an LSI to an existing table or change the table's primary key.

Without changing any existing data or transforming it through an external process?

No. You will need to manipulate the attibutes. Although arbitrary range queries are not its strength, DynamoDB has a time series pattern that can be adapted to your query pattern.

Let's say you query mostly by a limitied number of days. You would add a GSI with yyyy-mm-dd PK (Partition Key). Rows are made unique by a SK (Sort Key) that concatenates the timestamp with the id: event_time#id. PK and SK together are the Index's Composite Primary Key.

GSIPK1 = yyyy-mm-dd # 2022-01-20
GSISK1 = event_time#id # 1642709874551#1yp3Or0KrPUBIC

Querying for a single day needs 1 query operation, for a calendar week range needs 7 operations.

GSI1PK = "2022-01-20" AND GSI1SK > ""

Query a range within a day by adding a SK between condition:

GSI1PK = "2022-01-20" AND GSI1SK BETWEEN "1642709874" AND "16427098745"
fedonev
  • 20,327
  • 2
  • 25
  • 34
  • With `GSIPK1 = id, GSISK1 = event_time` don't you still have to first query by the `id` then sort by `event_time`? Basically I don't know or care about the `id` at all, but I thought with Dynamodb you could only query the sort key _after_ first restricting by the partition key `id`? – J-Deq87 Jan 20 '22 at 21:06
  • Yes, you are correct. It was not clear to me that `id` is not part of any query (because why would it have been chosen as the PK to begin with?). The challenge is to find a key definition that preserves row uniqueness (which is also a requirement for GSIs). `event_time` is not unique on its own. – fedonev Jan 20 '22 at 21:20
  • If you make event_time the GSI1PK at millisecond resolution then to pull the data for a 1 second time period you'll need to do 1,000 (parallel) query calls to fetch all the items from each millisecond. If you could have event_second as the PK and event_time as the SK then you'd have one query per second of time period. – hunterhacker Jan 20 '22 at 21:36
  • @Justin-Deq I edited the answer to deal with the requirement to ignore `id`. – fedonev Jan 20 '22 at 21:49
  • 1
    @fedonev - Thank you. I will review and test things out. I apologize for not being clear about `id` being worthless. The design was before my time and they didn't know `event_time` was _actually_ more important in the long run! Stay tuned... – J-Deq87 Jan 21 '22 at 21:07
  • @fedonev The AWS docs say the PK should be unique: `primary key must be unique for each item in the table. Failure to choose an appropriate primary key can lead to uneven data distribution and hot keys, which might cause throttling`. Wouldn't your solution violate that since I might have thousands of records against a single day like `"2022-01-20"`? – J-Deq87 Jan 28 '22 at 16:06
  • @Justin-Deq No violation. `PK` is short for _Partition Key_ in [DynamoDB-speak](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-modeling-nosql-B.html). The answer's GSI key is a [Composite Primary Key](https://aws.amazon.com/blogs/database/choosing-the-right-dynamodb-partition-key/) of `PK` and `SK` together. – fedonev Jan 28 '22 at 17:35
1

It seems like one can create a global secondary index at any point.

Below is an excerpt from the Managing Global Secondary Indexes documentation which can be found here https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.OnlineOps.html

To add a global secondary index to an existing table, use the UpdateTable operation with the GlobalSecondaryIndexUpdates parameter.

jddsm
  • 53
  • 7