0

There exits a data-set as described in the below table. Sr.no is used in the below table only for reference

|sr.no|    id    | tis |data-type|   b.id   |idType_2|  var_2 |     
|-----|----------|-----|---------|----------|--------|--------|
|  1  |abc-def-gi|12345|  a-type |1234567890| 843023 | NULL   |
|-----|----------|-----|---------|----------|--------|--------| 
|  2  |1234567890|12346|  b-type |    NULL  |  NULL  |40030230|
|-----|----------|-----|---------|----------|--------|--------|
|  3  |abc-def-gj|12347|  a-type |1234567890| 843023 |  NULL  |

Query types

  1. Input id and if data-type is a-type return fields tis,b.id,id_type2 reference sr.no=1
  2. Input id and if data-type is b-type return field var_2 reference sr.no=2
  3. Input id_type2 return fields id,tis,b.id of sr.no=1,3
  4. Input data-type return id based on tis between 12345 and 12347

Note

  • sr.no=1,3 or a-type of data is inserted 100k a times a day with unique id
  • sr.no=2 or b-type of data is a fixed set of data.

Is the below key approach efficient for a dataset like this? Is there any other approach that can be followed to store and retrieve data from DynamoDB?

Partition Key = idto take care of Query 1,2.

GSI1=id_type2 and GSI1SK=id to take care of Query 3

GSI2=data-type and GSI2SK=tis to take care of Query 4

user2967920
  • 469
  • 1
  • 8
  • 22
  • 1
    Just by looking at the data structure and query types that you've provided, your approach looks solid to me. – namuny Feb 05 '19 at 10:38
  • Why have you put two sets of different data in one table? – F_SO_K Feb 05 '19 at 10:42
  • @Stu following noSQL pattern where a single table is used to hold data, also queryType4's output needs b.id data with sk on timestamps. b-type produces data that's referenced by a-type – user2967920 Feb 05 '19 at 11:26

2 Answers2

0

Here are my thoughts:

1) if you have data that has different access patterns you should consider splitting the data into different tables

2) if data is accessed together, store it together - what this means is that if whenever you read a-type data for some modeled entity, you also need to read one or more b-type records for the same entity, it is advantageous to place all these records in the same table, under the same partition key

To bring this all home, in your example, the ID for type a and type b data is different. This means that you get 0 benefit from storing both type a and type b in the same table. Use two different tables.

3) data that is not accessed together does not benefit at all from being placed in the same table and in fact has the potential to become an issue in more extreme circumstances

The main difference between relational vs non-relational databases is that in non-relational stores you don't have cross table joins, therefore whereas one of the tenets of relational databases is data normalization the opposite tends to be the case for non-relational.

Mike Dinescu
  • 54,171
  • 16
  • 118
  • 151
  • Used this https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/images/tabledesign.png The reason type-a and type-b is kept in the same table was to avoid making 2 tables to hold data that the same backup engines use. Keeping it the above way, the backup engine has to do one query to get all id's for the day to run the backup queries of which the data resides in a separate table. type-b data is static, type-a data is dynamic. The reason I didn't put type-b's data in type-a itself was there would replication of the static data. So according to you, type-b belongs to another table? – user2967920 Feb 06 '19 at 04:26
  • 1
    If the data is accessed together it should be in the same table. I’m sorry it’s a bit abstract to talk about type as and type bs.. also, the thing is that these rules are not absolutes: there is a lot of room for tradeoffs – Mike Dinescu Feb 06 '19 at 06:06
0

This was solved by dong the following insde DynamoDB wihout creating any GSI.

When a GSI is created, whatever data is written in the main table is copied into the GSI table so WriteCost is x Number of GSIs. If you have 1 GSI this is PrimaryWrite+GSIWrite if you have 2 GSIs, then it's Primary + GSI1 + GSI2. Also the write into the GSI is the same as the primary so if you're wiriting into primary at 1000 WCU, the same will apply to the GSI so it will be a total of 2000 WCU for 1GSI and 3000WCU for 2 GSIs.

What we did

application_unique_id as hash key
timestamp as sort key

The rest of the keys were stored as attributes (DynamoDB supports dynamic JSON provided there is a valid hash key and a sort key).

We used a Lambda function attached to the table's DynamoDB Stream to write data into an ElasticSearch cluster.

We made a daily index of the latest snapshot data as DynamoDB holds all the trace points and is the best place to keep and query those.

This way we knew what data was sent on which day (as dynamodb doesn't let the user export a list of hash-keys). And we could do all the rest of the projected and comparison queries inside ElasticSearch.

DynamoDB solved the querying time series data at sub millisecond latency level ElasticSearch solved the problem of all the comparison and filter operations on top of the data.

Set DynamoDB ttl to 30days, ElasticSearch doesn't support ttl however we drop the daily index once the index creation day crosses 30days.

user2967920
  • 469
  • 1
  • 8
  • 22