1

Does GSI Overloading provide any performance benefits, e.g. by allowing cached partition keys to be more efficiently routed? Or is it mostly about preventing you from running out of GSIs? Or maybe opening up other query patterns that might not be so immediately obvious.

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-gsi-overloading.html

e.g. I you have a base table and you want to partition it so you can query a specific attribute (which becomes the PK of the GSI) over two dimensions, does it make any difference if you create 1 overloaded GSI, or 2 non-overloaded GSIs.

For an example of what I'm referring to see the attached image:

https://drive.google.com/file/d/1fsI50oUOFIx-CFp7zcYMij7KQc5hJGIa/view?usp=sharing

The base table has documents which can be in a published or draft state. Each document is owned by a single user. I want to be able to query by user to find:

  1. Published documents by date
  2. Draft documents by date

I'm asking in relation to the more recent DynamoDB best practice that implies that all applications only require one table. Some of the techniques being shown in this documentation show how a reasonably complex relational model can be squashed into 1 DynamoDB table and 2 GSIs and yet still support 10-15 query patterns.

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-relational-modeling.html

I'm trying to understand why someone would go down this route as it seems incredibly complicated.

Martin Bayly
  • 2,413
  • 3
  • 19
  • 19
  • 1
    I just wrote a long Q&A which might help a bit https://stackoverflow.com/questions/55152296/how-to-model-one-to-one-one-to-many-and-many-to-many-relationships-in-dynamodb – F_SO_K Mar 13 '19 at 22:49
  • 1
    In short, don't do it! – F_SO_K Mar 13 '19 at 22:50

1 Answers1

2

The idea – in a nutshell – is to not have the overhead of doing joins on the database layer or having to go back to the database to effectively try to do the join on the application layer. By having the data sliced already in the format that your application requires, all you really need to do is basically do one select * from table where x = y call which returns multiple entities in one call (in your example that could be Users and Documents). This means that it will be extremely efficient and scalable on the db level. But also means that you'll be less flexible as you need to know the access patterns in advance and model your data accordingly.

See Rick Houlihan's excellent talk on this https://www.youtube.com/watch?v=HaEPXoXVf2k for why you'd want to do this.

I don't think it has any performance benefits, at least none that's not called out – which makes sense since it's the same query and storage engine.

That being said, I think there are some practical reasons for why you'd want to go with a single table as it allows you to keep your infrastructure somewhat simple: you don't have to keep track of metrics and/or provisioning settings for separate tables.

dandoen
  • 1,647
  • 5
  • 26
  • 44
  • Thanks, yes, my question was in response to watching that video. But my question is not so much about why you would want to use a GSI. More about why would you use a single table with GSI overloading as opposed to multiple tables/GSIs. The GSI limit per table is now 20. So I'm trying to understand if there are other reasons related to performance why you might prefer fewer tables/GSIs. E.g. if you have GSIs that use the same partition keys but in different ways. I believe each GSI is treated independently with respect to partitioning. – Martin Bayly Mar 13 '19 at 16:47
  • I don't think there is anything perf related. At least, none that I've heard of or seen called out. I updated my answer with some practical reasons. – dandoen Nov 27 '19 at 16:07