2

Below is a commonly shared function to iterate over all objects in a bucket, but what if I just want to iterate over a specific key ie lets say the S3 URI was: s3://test-data-lake/test1/test2/

And there was five json files after test two ie s3://test-data-lake/test1/test2/test1.json..

How can I change this code to handle the above ?

def iterate_bucket_items(bucket):
    """
    Generator that iterates over all objects in a given s3 bucket

    See http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.list_objects_v2 
    for return data format
    :param bucket: name of s3 bucket
    :return: dict of metadata for an object
    """


    client = boto3.client('s3')
    paginator = client.get_paginator('list_objects_v2')
    page_iterator = paginator.paginate(Bucket=bucket)

    for page in page_iterator:
        if page['KeyCount'] > 0:
            for item in page['Contents']:
                yield item


for i in iterate_bucket_items(bucket='my_bucket'):
    print i
Marcin
  • 215,873
  • 14
  • 235
  • 294
0004
  • 1,156
  • 1
  • 14
  • 49
  • To avoid the need for pagination, you can use the Bucket `Resource` interface rather than the `Client` interface. For example: `objects = s3.Bucket('mybucket').objects.filter(Prefix='test1/test2/')` – jarmod Oct 08 '21 at 00:27
  • below seems to work, ty! – 0004 Oct 11 '21 at 23:04

1 Answers1

3

You can use Prefix:

def iterate_bucket_items(bucket, prefix=''):
    """
    Generator that iterates over all objects in a given s3 bucket

    See http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.list_objects_v2 
    for return data format
    :param bucket: name of s3 bucket
    :return: dict of metadata for an object
    """


    client = boto3.client('s3')
    paginator = client.get_paginator('list_objects_v2')
    page_iterator = paginator.paginate(Bucket=bucket, Prefix=prefix)

    for page in page_iterator:
        if page['KeyCount'] > 0:
            for item in page['Contents']:
                yield item


for i in iterate_bucket_items(bucket='my_bucket', prefix='test1/test2/'):
    print(i)
Marcin
  • 215,873
  • 14
  • 235
  • 294