5

I am using the following code to query and paginate through a DynamoDB query:

    class DecimalEncoder(json.JSONEncoder):
        def default(self, o):
            if isinstance(o, decimal.Decimal):
                return str(o)
            return super(DecimalEncoder, self).default(o)
    
    
    def run(date: int, start_epoch: int, end_epoch: int):
        dynamodb = boto3.resource('dynamodb',
                                  region_name='REGION',
                                  config=Config(proxies={'https': 'PROXYIP'}))
    
        table = dynamodb.Table('XYZ')
    
        response = table.query(
            # ProjectionExpression="#yr, title, info.genres, info.actors[0]", #THIS IS A SELECT STATEMENT
            # ExpressionAttributeNames={"#yr": "year"},  #SELECT STATEMENT RENAME
            KeyConditionExpression=Key('date').eq(date) & Key('uid').between(start_epoch, end_epoch)
        )
    
        for i in response[u'Items']:
            print(json.dumps(i, cls=DecimalEncoder))
    
        while 'LastEvaluatedKey' in response:
            response = table.scan( ##IS THIS INEFFICIENT CODE?
                # ProjectionExpression=pe,
                # FilterExpression=fe,
                # ExpressionAttributeNames=ean,
                ExclusiveStartKey=response['LastEvaluatedKey']
            )
    
            for i in response['Items']:
                print(json.dumps(i, cls=DecimalEncoder))

Although this code works, it is incredibly slow and I fear that 'response = table.scan' is the result of this. I am under the impression that queries are much faster than scan's (as scans require an entire iteration of the table). Is this code causing a complete iteration of the database table?

This might be a separate question, but is there a more efficient way (with code examples) of doing this? I've attempted using Boto3's pagination but I could not get that working with queries either.

CpILL
  • 6,169
  • 5
  • 38
  • 37
dandev91
  • 1,691
  • 3
  • 22
  • 34

2 Answers2

7

The answer provided by Nadav Har'El was key to resolving this. I was incorrectly using DynamoDB pagination code examples by doing an initial DynamoDB query, but then using scan to paginate!

The correct way was to use query initially AND for pagination:

    class DecimalEncoder(json.JSONEncoder):
            def default(self, o):
                if isinstance(o, decimal.Decimal):
                    return str(o)
                return super(DecimalEncoder, self).default(o)
        
        
        def run(date: int, start_epoch: int, end_epoch: int):
            dynamodb = boto3.resource('dynamodb',
                                      region_name='REGION',
                                      config=Config(proxies={'https': 'PROXYIP'}))
        
            table = dynamodb.Table('XYZ')
        
            response = table.query(
                KeyConditionExpression=Key('date').eq(date) & Key('uid').between(start_epoch, end_epoch)
            )
        
            for i in response[u'Items']:
                print(json.dumps(i, cls=DecimalEncoder))
        
            while 'LastEvaluatedKey' in response:
                response = table.query(
                    KeyConditionExpression=Key('date').eq(date) & Key('uid').between(start_epoch, end_epoch),
                    ExclusiveStartKey=response['LastEvaluatedKey']
                )
        
                for i in response['Items']:
                    print(json.dumps(i, cls=DecimalEncoder))

I have still marked Nadav Har'El's response as correct as it was his answer that lead to this code example.

CpILL
  • 6,169
  • 5
  • 38
  • 37
dandev91
  • 1,691
  • 3
  • 22
  • 34
3

Unfortunately, yes, a "Scan" operation reads the entire table. You didn't say what is your table's partition key, but if it is a date, then what you are really doing here is to read a single partition, and this indeed, what a "Query" operation does much more efficiently, because it can jump directly to the required partition instead of scanning the entire table looking for it.

Even with Query, you still need to do paging exactly like you did, because there's a possibility that the partition still have a lot of items. But at least you won't be scanning the entire table.

By the way, scanning the entire table will cost you a lot of read operations. You can ask AWS how many reads were accounted for you, and this can help you catch cases where you're reading too much - beyond the obvious slowness which you noticed.

Nadav Har'El
  • 11,785
  • 1
  • 24
  • 45
  • I want to make this query as fast as possible and I know my dataset will be larger than DynamoDB's 1MB API limit, so I will have to use pagination. I wanted to confirm that my pagination code is the most efficient way to do this - because its really slow. My code above uses an initial query that uses a date partition, but then uses a scan query for the pagination... Is the scan query for the pagination the only way to do this? – dandev91 Apr 04 '19 at 23:45
  • No, as I said, a "Query" is better that a "Scan". A "Scan" always reads the entire database, while a "Query" reads just one partition (which also can be long so you also need to paginate through it). Please consult the documentation on the difference between a "scan" and a "query" operation. – Nadav Har'El Apr 04 '19 at 23:49
  • 1
    I've added my code answer below. My issue was that I didn't realise you could paginate using table.query because the code examples I found used table.scan. This should resolve my issue. I've added my code as an answer below. – dandev91 Apr 05 '19 at 01:32