103

My table is around 220mb with 250k records within it. I'm trying to pull all of this data into python. I realize this needs to be a chunked batch process and looped through, but I'm not sure how I can set the batches to start where the previous left off.

Is there some way to filter my scan? From what I read that filtering occurs after loading and the loading stops at 1mb so I wouldn't actually be able to scan in new objects.

Any assistance would be appreciated.

import boto3
dynamodb = boto3.resource('dynamodb',
    aws_session_token = aws_session_token,
    aws_access_key_id = aws_access_key_id,
    aws_secret_access_key = aws_secret_access_key,
    region_name = region
    )

table = dynamodb.Table('widgetsTableName')

data = table.scan()
Madura Pradeep
  • 2,378
  • 1
  • 30
  • 34
CJ_Spaz
  • 1,134
  • 2
  • 7
  • 10

10 Answers10

134

I think the Amazon DynamoDB documentation regarding table scanning answers your question.

In short, you'll need to check for LastEvaluatedKey in the response. Here is an example using your code:

import boto3
dynamodb = boto3.resource('dynamodb',
                          aws_session_token=aws_session_token,
                          aws_access_key_id=aws_access_key_id,
                          aws_secret_access_key=aws_secret_access_key,
                          region_name=region
)

table = dynamodb.Table('widgetsTableName')

response = table.scan()
data = response['Items']

while 'LastEvaluatedKey' in response:
    response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
    data.extend(response['Items'])
Taeber
  • 1,457
  • 1
  • 9
  • 8
  • 42
    While this may work, note that the [boto3 documentation](http://boto3.readthedocs.io/en/latest/reference/services/dynamodb.html#DynamoDB.Client.query) states _If LastEvaluatedKey is empty, then the "last page" of results has been processed and there is no more data to be retrieved._ So the test I'm using is `while response.get('LastEvaluatedKey')` rather than `while 'LastEvaluatedKey' in response`, just because "is empty" doesn't necessarily mean "isn't present," and this works in either case. – kungphu Aug 23 '16 at 02:44
  • 1
    paginator is more convenient way to iterate through queried/scanned items – iuriisusuk Feb 20 '18 at 12:33
  • @kungphu `response.get('LastEvaluatedKey')` I got `None`, it can't apply the condition to `while` loop. – John Jang Mar 04 '19 at 08:59
  • 2
    @John_J You could use `while True:` and then `if not response.get('LastEvaluatedKey'): break` or something similar. You could also put your processing in a function, call it, and then use the `while response.get(...):` above to call it agin to process subsequent pages. You basically just need to emulate `do... while`, which does not explicitly exist in Python. – kungphu Mar 04 '19 at 11:08
  • 6
    Why not use: `while response.get('LastEvaluatedKey', False)`? – Hephaestus May 31 '19 at 03:46
  • @Hephaestus that would work as well, but its not necessary. .get returns None by default if the requested key is not there, None evaluates to False. This can be confirmed by running `bool({}.get('test'))` – Jon H Aug 27 '20 at 23:04
42

DynamoDB limits the scan method to 1mb of data per scan.

Documentation: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/dynamodb.html#DynamoDB.Client.scan

Here is an example loop to get all the data from a DynamoDB table using LastEvaluatedKey:

import boto3
client = boto3.client('dynamodb')

def dump_table(table_name):
    results = []
    last_evaluated_key = None
    while True:
        if last_evaluated_key:
            response = client.scan(
                TableName=table_name,
                ExclusiveStartKey=last_evaluated_key
            )
        else: 
            response = client.scan(TableName=table_name)
        last_evaluated_key = response.get('LastEvaluatedKey')
        
        results.extend(response['Items'])
        
        if not last_evaluated_key:
            break
    return results

# Usage
data = dump_table('your-table-name')

# do something with data

Richard
  • 2,396
  • 23
  • 23
39

boto3 offers paginators that handle all the pagination details for you. Here is the doc page for the scan paginator. Basically, you would use it like so:

import boto3

client = boto3.client('dynamodb')
paginator = client.get_paginator('scan')

for page in paginator.paginate():
    # do something
Jordon Phillips
  • 14,963
  • 4
  • 35
  • 42
  • 14
    Note that the items in `page['Items']` may not be what you're expecting: Since this paginator is painfully generic, what you'll get back for each DynamoDB item is a dictionary of format type: value, e.g. `{'myAttribute': {'M': {}}, 'yourAttribute': {'N': u'132457'}}` for a row with an empty map and a numeric type (which is returned as a string that needs to be cast; I suggest `decimal.Decimal` for this since it already takes a string and will handle non-integer numbers). Other types, e.g. strings, maps, and booleans, are converted to their Python types by boto. – kungphu Jun 17 '16 at 04:34
  • 1
    is it possbile to have a scan filter or filterexpression with pagination? – MuntingInsekto Jul 15 '16 at 02:36
  • 7
    paginators would be great, if it weren't for the issue @kungphu raised. I don't see the use for something that does one useful thing, but negates it by polluting the response data with irrelevant metadata – Bruce Edge Apr 17 '17 at 20:35
  • @kungphu @Bruce, curious if yall are aware of any recent improvements for this "polluted" dictionary approach ? I'm thinking of switching back to resource instead of client, and just using `LastEvaluatedKey` approach .. it just feels like too much to have to paginate and then have to parse out the response – D.Tate Mar 13 '20 at 17:06
  • 1
    Ah nevermind... I think I found my answer in the form of `TypeDeserializer` (https://stackoverflow.com/a/46738251/923817). Sweet! – D.Tate Mar 13 '20 at 17:28
  • 1
    @D.Tate Glad you found your solution. My work lately is all in Clojure, and the libraries are much less obtuse (though it only gets so good, working with Amazon's APIs). :) And thank you for linking that here, for others who might find this question later! – kungphu Mar 14 '20 at 18:16
36

Riffing off of Jordon Phillips's answer, here's how you'd pass a FilterExpression in with the pagination:

import boto3

client = boto3.client('dynamodb')
paginator = client.get_paginator('scan')
operation_parameters = {
  'TableName': 'foo',
  'FilterExpression': 'bar > :x AND bar < :y',
  'ExpressionAttributeValues': {
    ':x': {'S': '2017-01-31T01:35'},
    ':y': {'S': '2017-01-31T02:08'},
  }
}

page_iterator = paginator.paginate(**operation_parameters)
for page in page_iterator:
    # do something
Abe Voelker
  • 30,124
  • 14
  • 81
  • 98
  • what represents a contain filter? – DenCowboy Apr 02 '20 at 13:10
  • @DenCowboy I think the FilterExpression would just look like `'FilterExpression': 'contains(Color, :x)'`. See the CLI example here: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Expressions.ExpressionAttributeValues.html – Abe Voelker Apr 02 '20 at 17:38
8

Code for deleting dynamodb format type as @kungphu mentioned.

import boto3

from boto3.dynamodb.types import TypeDeserializer
from boto3.dynamodb.transform import TransformationInjector

client = boto3.client('dynamodb')
paginator = client.get_paginator('query')
service_model = client._service_model.operation_model('Query')
trans = TransformationInjector(deserializer = TypeDeserializer())
for page in paginator.paginate():
    trans.inject_attribute_value_output(page, service_model)
Vincent
  • 101
  • 1
  • 2
  • 2
    Bravo! negates my earlier comment above about the lack of usefulness of paginators. thanks! Why is this not the default behavior? – Bruce Edge Apr 17 '17 at 22:00
  • I had issues with LastEvaluatedKey being transformed and that messed up the paginator. – Dan Hook Jul 10 '19 at 14:50
5

Turns out that Boto3 captures the "LastEvaluatedKey" as part of the returned response. This can be used as the start point for a scan:

data= table.scan(
   ExclusiveStartKey=data['LastEvaluatedKey']
)

I plan on building a loop around this until the returned data is only the ExclusiveStartKey

CJ_Spaz
  • 1,134
  • 2
  • 7
  • 10
5

The 2 approaches suggested above both have problems: Either writing lengthy and repetitive code that handles paging explicitly in a loop, or using Boto paginators with low-level sessions, and foregoing the advantages of higher-level Boto objects.

A solution using Python functional code to provide a high-level abstraction allows higher-level Boto methods to be used, while hiding the complexity of AWS paging:

import itertools
import typing

def iterate_result_pages(function_returning_response: typing.Callable, *args, **kwargs) -> typing.Generator:
    """A wrapper for functions using AWS paging, that returns a generator which yields a sequence of items for
    every response

    Args:
        function_returning_response: A function (or callable), that returns an AWS response with 'Items' and optionally 'LastEvaluatedKey'
        This could be a bound method of an object.

    Returns:
        A generator which yields the 'Items' field of the result for every response
    """
    response = function_returning_response(*args, **kwargs)
    yield response["Items"]
    while "LastEvaluatedKey" in response:
        kwargs["ExclusiveStartKey"] = response["LastEvaluatedKey"]
        response = function_returning_response(*args, **kwargs)
        yield response["Items"]

    return

def iterate_paged_results(function_returning_response: typing.Callable, *args, **kwargs) -> typing.Iterator:
    """A wrapper for functions using AWS paging, that returns an iterator of all the items in the responses.
    Items are yielded to the caller as soon as they are received.

    Args:
        function_returning_response: A function (or callable), that returns an AWS response with 'Items' and optionally 'LastEvaluatedKey'
        This could be a bound method of an object.

    Returns:
        An iterator which yields one response item at a time
    """
    return itertools.chain.from_iterable(iterate_result_pages(function_returning_response, *args, **kwargs))

# Example, assuming 'table' is a Boto DynamoDB table object:
all_items = list(iterate_paged_results(ProjectionExpression = 'my_field'))
Isac Casapu
  • 1,163
  • 13
  • 21
  • 1
    Small change to the example to show table arg, all_items = list(iterate_paged_results(table.scan, ProjectionExpression = 'my_field')) – bruce szalwinski Nov 26 '19 at 04:52
  • 1
    I actually like this solution the most. It combines the simplicity of items access and abstracts the pagination away. The only complaint I have is that it's overengineered a bit, the same could be done with a single function and without functools - just yield each item from `response["Items"]` both times. – demosito Jun 02 '20 at 16:25
5

If you are landing here looking for a paginated scan with some filtering expression(s):

def scan(table, **kwargs):
    response = table.scan(**kwargs)
    yield from response['Items']
    while response.get('LastEvaluatedKey'):
        response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'], **kwargs)
        yield from response['Items']

Example usage:

table = boto3.Session(...).resource('dynamodb').Table('widgetsTableName')

items = list(scan(table, FilterExpression=Attr('name').contains('foo')))
Pierre D
  • 24,012
  • 7
  • 60
  • 96
4

I had some problems with Vincent's answer related to the transformation being applied to the LastEvaluatedKey and messing up the pagination. Solved as follows:

import boto3

from boto3.dynamodb.types import TypeDeserializer
from boto3.dynamodb.transform import TransformationInjector

client = boto3.client('dynamodb')
paginator = client.get_paginator('scan')
operation_model = client._service_model.operation_model('Scan')
trans = TransformationInjector(deserializer = TypeDeserializer())
operation_parameters = {
  'TableName': 'tablename',  
}
items = []

for page in paginator.paginate(**operation_parameters):
    has_last_key = 'LastEvaluatedKey' in page
    if has_last_key:
        last_key = page['LastEvaluatedKey'].copy()
    trans.inject_attribute_value_output(page, operation_model)
    if has_last_key:
        page['LastEvaluatedKey'] = last_key
    items.extend(page['Items'])
Dan Hook
  • 6,769
  • 7
  • 35
  • 52
3

I can't work out why Boto3 provides high-level resource abstraction but doesn't provide pagination. When it does provide pagination, it's hard to use!

The other answers to this question were good but I wanted a super simple way to wrap the boto3 methods and provide memory-efficient paging using generators:

import typing
import boto3
import boto3.dynamodb.conditions


def paginate_dynamodb_response(dynamodb_action: typing.Callable, **kwargs) -> typing.Generator[dict, None, None]:

    # Using the syntax from https://github.com/awsdocs/aws-doc-sdk-examples/blob/main/python/example_code/dynamodb/GettingStarted/scenario_getting_started_movies.py
    keywords = kwargs

    done = False
    start_key = None

    while not done:
        if start_key:
            keywords['ExclusiveStartKey'] = start_key

        response = dynamodb_action(**keywords)

        start_key = response.get('LastEvaluatedKey', None)
        done = start_key is None

        for item in response.get("Items", []):
            yield item


## Usage ##
dynamodb_res = boto3.resource('dynamodb')
dynamodb_table = dynamodb_res.Table('my-table')

query = paginate_dynamodb_response(
    dynamodb_table.query, # The boto3 method. E.g. query or scan
    # Regular Query or Scan parameters
    #
    # IndexName='myindex' # If required
    KeyConditionExpression=boto3.dynamodb.conditions.Key('id').eq('1234')
)

for x in query:
    print(x)```

Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100