22

I have a large number of files (>1,000) stored in an S3 bucket, and I would like to iterate over them (e.g. in a for loop) to extract data from them using boto3.

However, I notice that in accordance with http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.list_objects, the list_objects() method of the Client class only lists up to 1,000 objects:

In [1]: import boto3

In [2]: client = boto3.client('s3')

In [11]: apks = client.list_objects(Bucket='iper-apks')

In [16]: type(apks['Contents'])
Out[16]: list

In [17]: len(apks['Contents'])
Out[17]: 1000

However, I would like to list all the objects, even if there are more than 1,000. How could I achieve this?

Kurt Peek
  • 52,165
  • 91
  • 301
  • 526

3 Answers3

33

As kurt-peek notes, boto3 has a Paginator class, which allows you to iterator over pages of s3 objects, and can easily be used to provide an iterator over items within the pages:

import boto3


def iterate_bucket_items(bucket):
    """
    Generator that iterates over all objects in a given s3 bucket

    See http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.list_objects_v2 
    for return data format
    :param bucket: name of s3 bucket
    :return: dict of metadata for an object
    """


    client = boto3.client('s3')
    paginator = client.get_paginator('list_objects_v2')
    page_iterator = paginator.paginate(Bucket=bucket)

    for page in page_iterator:
        if page['KeyCount'] > 0:
            for item in page['Contents']:
                yield item


for i in iterate_bucket_items(bucket='my_bucket'):
    print i

Which will output something like:

{u'ETag': '"a8a9ee11bd4766273ab4b54a0e97c589"',
 u'Key': '2017-06-01-10-17-57-EBDC490AD194E7BF',
 u'LastModified': datetime.datetime(2017, 6, 1, 10, 17, 58, tzinfo=tzutc()),
 u'Size': 242,
 u'StorageClass': 'STANDARD'}
{u'ETag': '"03be0b66e34cbc4c037729691cd5efab"',
 u'Key': '2017-06-01-10-28-58-732EB022229AACF7',
 u'LastModified': datetime.datetime(2017, 6, 1, 10, 28, 59, tzinfo=tzutc()),
 u'Size': 238,
 u'StorageClass': 'STANDARD'}
...

Note that list_objects_v2 is recommended instead of list_objects: https://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html

You can also do this at a lower level by calling list_objects_v2() directly and passing in the NextContinuationToken value from the response as ContinuationToken while isTruncated is true in the response.

NpnSaddy
  • 317
  • 3
  • 11
John Carter
  • 53,924
  • 26
  • 111
  • 144
3

I found out that boto3 has a Paginator class to deal with truncated results. The following worked for me:

paginator = client.get_paginator('list_objects')
page_iterator = paginator.paginate(Bucket='iper-apks')

after which I can use the page_iterator generator in a for loop.

Kurt Peek
  • 52,165
  • 91
  • 301
  • 526
  • 1
    The documentation suggests using "list_objects_v2" for new development see - https://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html - this is also supported by the paginator, see http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Paginator.ListObjectsV2 – John Carter May 29 '17 at 09:18
-1
import com.amazonaws.regions.Regions
import com.amazonaws.services.s3.AmazonS3ClientBuilder
import com.amazonaws.services.s3.model.ListObjectsRequest
import java.util._

import scala.collection.JavaConverters._

val s3client = AmazonS3ClientBuilder.standard().withRegion(Regions.US_EAST_1).build()
val listObjectsRequest = new ListObjectsRequest().withBucketName("<enter_bucket_name>").withPrefix("<enter_path>").withDelimiter("/")
val bucketListing = s3client.listObjects(listObjectsRequest).getCommonPrefixes.asScala

println("")

for (file <- bucketListing) {
    println(file)
}

println("")
Rajiv Singh
  • 958
  • 1
  • 9
  • 14
  • Thank you for this code snippet, which might provide some limited, immediate help. A [proper explanation](https://meta.stackexchange.com/q/114762/9193372) would greatly improve its long-term value by showing why this is a good solution to the problem and would make it more useful to future readers with other, similar questions. Please edit your answer to add some explanation, including the assumptions you’ve made. – Syscall Mar 24 '21 at 13:32
  • Same as [this post](https://stackoverflow.com/a/66782105/9193372) – Syscall Mar 24 '21 at 13:33