2

I have a DynamoDB table. I have an index on the "Id" field.

I also have other fields - status, timestamp, etc, but I don't have index on these other fields.

The status can be "online", "offline", "in-progress", etc.

I want to get the count of the records based on "status" and "Id" field.

The user will pass the Id field and the query needs to return the count based on the status field. e.g.

"online" : 20

"offline" : 30

"in-progress" : 40

The query works fine.

As I understand, the maximum size of the DynamoDb query output is 1 MB. This limit applies before any FilterExpression is applied to the results.

Since the number of records in the table are huge, ( around 100k), I need to execute the queries again and again by passing "Exclusive Start key" parameter.

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.html#Query.Pagination

In fact, I need to run multiple queries (one for each status value) in the loop, for calculating the counts based on "status" field.

Is there any efficient way to retrieve theses counts?

I am thinking of extending the index to the status field also. So it will eliminate the need for applying filter expression.

Dattatray
  • 1,745
  • 1
  • 21
  • 49

2 Answers2

0

If the field isn't indexed, you need to do a table scan to get the full count. You can parallelize the scan to make it faster, or just index it.

tsnowlan
  • 3,472
  • 10
  • 15
  • How would an index help, if all the rows are being counted? In a SQL db, sure, if there's a covering index, but I'm not seeing how to optimize this query pattern with an index in DynamoDB. I may be overlooking something. – Michael - sqlbot Sep 12 '19 at 00:49
  • DynamoDB is basically a giant key value store with some indexes as a concession to functionality we're used to with SQL. It's a tradeoff for the high I/O that you get. You put a Global Secondary Index on `status`, and then you'll be able to query that field's contents directly. Without that index, the data is completely free form and the database can't tell you what's in a particular record. So to find out you do a big table scan and do the lookup yourself. – tsnowlan Sep 12 '19 at 07:56
  • The docs on GSIs: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html – tsnowlan Sep 12 '19 at 07:59
  • Yep, that makes sense. I thought you were saying more than you were saying. My mistake. – Michael - sqlbot Sep 12 '19 at 12:12
0

There are fields ScannedCount and Count yet even if field is indexed, you will get count of items only when result of query is less than 1MB.

If you have a lot of rows or single row is big, max size per row may be up to 400KB, so if you have rows of 400KB, you may scan only couple of such before hitting 1MB limit and you will get count of those. If you have small rows, you will be able to scan through more during single query. Yet in any case DynamoDB will not scan all the data to give you results on one go. You will get paginated results.

With proper index your query won't need use filters, w/o good index you will do index-scan or table-scan probably with applied filters but it does nothing to work around the fact - query will always scan up to 1MB of data and will return paginated results.

From the docs:

ScannedCount — The number of items that matched the key condition expression before a filter expression (if present) was applied.

Count — The number of items that remain after a filter expression (if present) was applied.

If the size of the Query result set is larger than 1 MB, ScannedCount and Count represent only a partial count of the total items. You need to perform multiple Query operations to retrieve all the results.

Each Query response contains the ScannedCount and Count for the items that were processed by that particular Query request. To obtain grand totals for all of the Query requests, you could keep a running tally of both ScannedCount and Count.

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.html

Lukas Liesis
  • 24,652
  • 10
  • 111
  • 109