61

Hi I have a dynamodb table. I want the service to return me all the items in this table and the order is by sorting on one attribute.

Do I need to create a global secondary index for this? If that is the case, what should be the hash key, what is the range key? (Note that query on gsi must specify a "EQ" comparator on the hash key of GSI.)

Thanks a lot!

Erben

Trenton
  • 11,678
  • 10
  • 56
  • 60
Erben Mo
  • 3,528
  • 3
  • 19
  • 32

3 Answers3

33

If you know the HashKey, then any query will return the items sorted by Range key. From the documentation:

Query results are always sorted by the range key. If the data type of the range key is Number, the results are returned in numeric order. Otherwise, the results are returned in order of UTF-8 bytes. By default, the sort order is ascending. To reverse the order, set the ScanIndexForward parameter set to false.

Now, if you need to return all the items, you should use a scan. You cannot order the results of a scan.

Another option is to use a GSI (example). Here, you see that the GSI contains only HashKey. The results I guess will be in sorted order of this key (I didn't check this part in a program yet!).

Ninjakannon
  • 3,751
  • 7
  • 53
  • 76
Sony Kadavan
  • 3,982
  • 2
  • 19
  • 26
  • 14
    The documentation describes ScanIndexForward as only applying to Query operations, not Scan. I don't see documentation saying this applies to Scan. – pauldoo Oct 08 '14 at 14:06
  • 1
    Results by querying over GSI will be ordered by range key of the GSI. I quote from the [GSI documentation](http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html) _The results are ordered by the range key, TopScore. If you set the ScanIndexForward parameter to false, the results are returned in descending order, so the highest score is returned first._ – manyu Jul 14 '15 at 09:35
  • 1
    there's nothing special about GSI's. A hash disperse keys, always, there's no such thing as a sorted hash table. – André Werlang May 04 '16 at 02:27
  • 1
    @Sony Kadavan I have created a table having 'partition key + sort key' as the 'primary key'. Now, I want to retrieve all the items with the help of partition key. Can I use 'scan' in this case or is there any other alternative? – unknownerror Jul 04 '16 at 07:07
  • @vardin have you found any solution – Rishabh Agrawal May 10 '17 at 08:17
  • @ANinJa, just do the query using partition key. You will be returned with list of items which are matching that partition key. Once you got those fields iterate over those in order to get the required sort key value. Since the item size was very small, above step was scalable for me. If the item size is huge, i would suggest you to use local secondary indexes and query using those indexes. – unknownerror May 11 '17 at 18:37
  • if you know mongoose is there any way like schema.find({}) ? – Rishabh Agrawal May 11 '17 at 18:41
12

As of now the dynamoDB scan cannot return you sorted results.

You need to use a query with a new global secondary index (GSI) with a hashkey and range field. The trick is to use a hashkey which is assigned the same value for all data in your table.

I recommend making a new field for all data and calling it "Status" and set the value to "OK", or something similar.

Then your query to get all the results sorted would look like this:

{
    TableName: "YourTable",
    IndexName: "Status-YourRange-index",
    KeyConditions: {
        Status: {
            ComparisonOperator: "EQ", 
            AttributeValueList: [ 
                "OK"
            ]
        }
    },
    ScanIndexForward: false
}

The docs for how to write GSI queries are found here: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html#GSI.Querying

Deemoe
  • 931
  • 10
  • 12
  • 59
    This is terrible advice. The performance characteristics of a DynamoDB table apply the same to GSIs. A GSI with a single hash key of "OK" will only ever use one partition. This loses all scaling characteristics of DynamoDB. – prestomation Aug 25 '15 at 19:23
  • 4
    That is true results will come from a single partition, however if used in conjunction with a limit then it can be efficient. If you limit to say 10 items you'll get the first 10 items in sorted order and it will not be an expensive query. (other than having the maintain the 2nd GSI) So it is efficient if you just want to display the most recent items on the list for example. – Deemoe Aug 25 '15 at 21:38
  • 2
    The author of the question didn't say how many items are in his table or how many read units he wanted to provision but it looks like even a single partition can support 1000 read units fairly comfortably, up to 3000 read units max. Source: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html – Deemoe Aug 25 '15 at 21:58
  • 1
    A single key(Table or GSI) can only receive 1000 writes per second maximum. If the table itself is well distributed it can take an 'unlimited' number of writes. Whenever a write is performed against the table, a write is queued up again the GSI. A GSI of this schema will have a write throughput maximum of 1000. If you are writing to your table more than 1000 writes per second ongoing, the GSI queue will fall behind and never catch up. This means you will query the GSI and entries will simply be missing as they haven't been written to the index yet, even if they have been written to the table. – prestomation Aug 26 '15 at 22:01
  • 37
    @prestomation this it is not terrible advice. True, it doesn't scale well. But quite the contrary, respecting the constraints (1000 writes/s, 3000 reads/s, 10 GB table size) it can sustain a very cost-effective solution.Terrible advice is blindly following best practices and turn away from real-world solutions. – André Werlang May 04 '16 at 02:33
  • 2
    This is functionally equivalent to "index write sharing" with only 1 shard. See https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-indexes-gsi-sharding.html – Trenton Apr 16 '18 at 21:39
  • Agreed with @AndréWerlang . This should be the accepted answer as it's the only answer that actually does what the OP was hoping for. Yes, it *does* have limitations and must be used carefully. – kdavh May 05 '20 at 19:21
  • 4
    One thing to note is that AWS Best Practices for Handling Time Series Data, while it recommends using multiple tables to allow for more efficient provisioned read/write capacity, the recommended solution uses the same partition key for every item in each table: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-time-series.html – khaullen Jul 21 '20 at 04:25
  • what happens if i have duplicates? example: {status:ok, score:12} {status:ok, score:12} – Ushan Fernando Jan 05 '23 at 12:09
12

Approach I followed to solve this problem is by creating a Global Secondary Index as below. Not sure if this is the best approach but posting it if it is useful to someone.

Hash Key                 | Range Key
------------------------------------
Date value of CreatedAt  | CreatedAt

Limitation imposed on the HTTP API user to specify the number of days to retrieve data, defaults to 24 hr.

This way, I can always specify the HashKey as Current date's day and RangeKey can use > and < operators while retrieving. This way the data is also spread across multiple shards.

Gireesh
  • 677
  • 7
  • 14
  • 2
    This is the pattern recommended in the [best practice for handling time series data](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-time-series.html) which itself is a type of [index write sharding](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-indexes-gsi-sharding.html). – Trenton Apr 16 '18 at 21:41