28

I am using below mentioned code to get list of all file names from s3 bucket. I have two bucket in s3. For one of the bucket below code returns all the file names (more than 1000), but the same code returns only 1000 file names for another bucket. I just don't get what is happening. Why same code running for one bucket and not for other ?

Also my bucket have hierarchy structure folder/filename.jpg.

ObjectListing objects = s3.listObjects("bucket.new.test");
do {
    for (S3ObjectSummary objectSummary : objects.getObjectSummaries()) {
        String key = objectSummary.getKey();
        System.out.println(key);
    }
    objects = s3.listNextBatchOfObjects(objects);
} while (objects.isTruncated());
Tobias Mayr
  • 196
  • 4
  • 16
Abhishek
  • 471
  • 1
  • 4
  • 12

10 Answers10

22

Improving on @Abhishek's answer. This code is slightly shorter and variable names are fixed.

You have to get the object listing, add its' contents to the collection, then get the next batch of objects from the listing. Repeat the operation until the listing will not be truncated.

List<S3ObjectSummary> keyList = new ArrayList<S3ObjectSummary>();
ObjectListing objects = s3.listObjects("bucket.new.test");
keyList.addAll(objects.getObjectSummaries());

while (objects.isTruncated()) {
    objects = s3.listNextBatchOfObjects(objects);
    keyList.addAll(objects.getObjectSummaries());
}
DwB
  • 37,124
  • 11
  • 56
  • 82
oferei
  • 1,610
  • 2
  • 19
  • 27
  • But what is the root cause? Why the same code had worked for one case and hadn't for another? – morsik Feb 27 '17 at 13:45
  • That's a good question, which I don't have the answer for. I only took @ Abhishek's code and "fixed" it. My only guess is that it's a property of the bucket. – oferei Feb 28 '17 at 14:04
  • 2
    I've got the same issue with "old" version of s3 java API. Amazon introduced "v2", which should resolve the issue: http://docs.aws.amazon.com/AmazonS3/latest/dev/ListingObjectKeysUsingJava.html Note, it uses `s3client.listObjectsV2` and `req.setContinuationToken(result.getNextContinuationToken())`. The last one should make separate underlying REST GET calls to s3 (as single get returns up to 1000 keys by default, http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html) – morsik Feb 28 '17 at 15:59
8

For Scala developers, here it is recursive function to execute a full scan and map of the contents of an AmazonS3 bucket using the official AWS SDK for Java

import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.{S3ObjectSummary, ObjectListing, GetObjectRequest}
import scala.collection.JavaConversions.{collectionAsScalaIterable => asScala}

def map[T](s3: AmazonS3Client, bucket: String, prefix: String)(f: (S3ObjectSummary) => T) = {

  def scan(acc:List[T], listing:ObjectListing): List[T] = {
    val summaries = asScala[S3ObjectSummary](listing.getObjectSummaries())
    val mapped = (for (summary <- summaries) yield f(summary)).toList

    if (!listing.isTruncated) mapped.toList
    else scan(acc ::: mapped, s3.listNextBatchOfObjects(listing))
  }

  scan(List(), s3.listObjects(bucket, prefix))
}

To invoke the above curried map() function, simply pass the already constructed (and properly initialized) AmazonS3Client object (refer to the official AWS SDK for Java API Reference), the bucket name and the prefix name in the first parameter list. Also pass the function f() you want to apply to map each object summary in the second parameter list.

For example

val keyOwnerTuples = map(s3, bucket, prefix)(s => (s.getKey, s.getOwner))

will return the full list of (key, owner) tuples in that bucket/prefix

or

map(s3, "bucket", "prefix")(s => println(s))

as you would normally approach by Monads in Functional Programming

pangiole
  • 981
  • 9
  • 10
  • Paolo, thank you for an answer, I really liked your approach. Here's a question, do we need `acc` parameter here? It seems to me like the piece of code with recursion should be turned into something similar to `else mapped ::: scan(s3.listNextBatchOfObjects(listing))`. – GoodDok Dec 05 '18 at 15:14
  • 1
    I think there is a bug in your above code. Instead of `mapped.toList` it should be `acc ++ mapped.toList` to return not just the last set of S3 file keys but all of the keys. – van_d39 Sep 19 '19 at 02:37
6

I have just changed above code to use addAll instead of using a for loop to add objects one by one and it worked for me:

List<S3ObjectSummary> keyList = new ArrayList<S3ObjectSummary>();
ObjectListing object = s3.listObjects("bucket.new.test");
keyList = object.getObjectSummaries();
object = s3.listNextBatchOfObjects(object);

while (object.isTruncated()){
  keyList.addAll(current.getObjectSummaries());
  object = s3.listNextBatchOfObjects(current);
}
keyList.addAll(object.getObjectSummaries());

After that you can simply use any iterator over list keyList.

Xavier Guihot
  • 54,987
  • 21
  • 291
  • 190
Abhishek
  • 471
  • 1
  • 4
  • 12
  • I suggest using keyList.addAll(x) instead of assigning to keyList. This way you're not modifying a private member of ObjectListing (which was returned by getObjectSummaries) later using addAll. And, since you've already allocated a list in the first line, you're all set. – oferei Jan 13 '15 at 21:14
4

An alternative way by using recursive method

/**
 * A recursive method to wrap {@link AmazonS3} listObjectsV2 method.
 * <p>
 * By default, ListObjectsV2 can only return some or all (UP TO 1,000) of the objects in a bucket per request.
 * Ref: https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html
 * <p>
 * However, this method can return unlimited {@link S3ObjectSummary} for each request.
 *
 * @param request
 * @return
 */
private List<S3ObjectSummary> getS3ObjectSummaries(final ListObjectsV2Request request) {
    final ListObjectsV2Result result = s3Client.listObjectsV2(request);
    final List<S3ObjectSummary> resultSummaries = result.getObjectSummaries();
    if (result.isTruncated() && isNotBlank(result.getNextContinuationToken())) {
        final ListObjectsV2Request nextRequest = request.withContinuationToken(result.getNextContinuationToken());
        final List<S3ObjectSummary> nextResultSummaries = this.getS3ObjectSummaries(nextRequest);
        resultSummaries.addAll(nextResultSummaries);
    }
    return resultSummaries;
}
General Grievance
  • 4,555
  • 31
  • 31
  • 45
leo
  • 454
  • 5
  • 10
3

If you want to get all of object (more than 1000 keys) you need to send another packet with the last key to S3. Here is the code.

private static String lastKey = "";
private static String preLastKey = "";
...

do{
        preLastKey = lastKey;
        AmazonS3 s3 = new AmazonS3Client(new ClasspathPropertiesFileCredentialsProvider());

        String bucketName = "bucketname";           

        ListObjectsRequest lstRQ = new ListObjectsRequest().withBucketName(bucketName).withPrefix("");  

        lstRQ.setMarker(lastKey);  

        ObjectListing objectListing = s3.listObjects(lstRQ);

        //  loop and get file on S3
        for (S3ObjectSummary objectSummary : objectListing.getObjectSummaries()) {
             //   get oject and do something.....
        }
}while(lastKey != preLastKey);
Sy Loc
  • 64
  • 5
2

In Scala:

val first = s3.listObjects("bucket.new.test")

val listings: Seq[ObjectListing] = Iterator.iterate(Option(first))(_.flatMap(listing =>
  if (listing.isTruncated) Some(s3.listNextBatchOfObjects(listing))
  else None
))
  .takeWhile(_.nonEmpty)
  .toList
  .flatten
Ori Popowski
  • 10,432
  • 15
  • 57
  • 79
1
  1. Paolo Angioletti's code can't get all the data, only the last batch of data.
  2. I think it might be better to use ListBuffer.
  3. This method does not support setting startAfterKey.
    import com.amazonaws.services.s3.AmazonS3Client
    import com.amazonaws.services.s3.model.{ObjectListing, S3ObjectSummary}    
    import scala.collection.JavaConverters._
    import scala.collection.mutable.ListBuffer

    def map[T](s3: AmazonS3Client, bucket: String, prefix: String)(f: (S3ObjectSummary) => T): List[T] = {

      def scan(acc: ListBuffer[T], listing: ObjectListing): List[T] = {
        val r = acc ++= listing.getObjectSummaries.asScala.map(f).toList
        if (listing.isTruncated) scan(r, s3.listNextBatchOfObjects(listing))
        else r.toList
      }

      scan(ListBuffer.empty[T], s3.listObjects(bucket, prefix))
    }

The second method is to use awssdk-v2

<dependency>
    <groupId>software.amazon.awssdk</groupId>
    <artifactId>s3</artifactId>
    <version>2.1.0</version>
</dependency>
  import software.amazon.awssdk.services.s3.S3Client
  import software.amazon.awssdk.services.s3.model.{ListObjectsV2Request, S3Object}

  import scala.collection.JavaConverters._

  def listObjects[T](s3: S3Client, bucket: String,
                     prefix: String, startAfter: String)(f: (S3Object) => T): List[T] = {
    val request = ListObjectsV2Request.builder()
      .bucket(bucket).prefix(prefix)
      .startAfter(startAfter).build()

    s3.listObjectsV2Paginator(request)
      .asScala
      .flatMap(_.contents().asScala)
      .map(f)
      .toList
  }
iDuanYingJie
  • 663
  • 4
  • 11
0

By default the API returns up to 1,000 key names. The response might contain fewer keys but will never contain more. A better implementation would be use the newer ListObjectsV2 API:

List<S3ObjectSummary> docList=new ArrayList<>();
    ListObjectsV2Request req = new ListObjectsV2Request().withBucketName(bucketName).withPrefix(folderFullPath);
    ListObjectsV2Result listing;
    do{
        listing=this.getAmazonS3Client().listObjectsV2(req);
        docList.addAll(listing.getObjectSummaries());
        String token = listing.getNextContinuationToken();
        req.setContinuationToken(token);
        LOG.info("Next Continuation Token for listing documents is :"+token);
    }while (listing.isTruncated());
atul jha
  • 11
  • 2
0

The code given by @oferei works good and I upvote that code. But I want to point out the root issue with the @Abhishek's code. Actually, the problem is with your do while loop.

If you carefully observe, you are fetching the next batch of objects in the second last statement and then you check is you have exhausted the total list of files. So, when you fetch the last batch, isTruncated() becomes false and you break out of loop and don't process the last X%1000 records. For eg: if in total you had 2123 records, you will end up fetching 1000 and then 1000 i.e 2000 records. You will miss the 123 records because your isTruncated value will break the loop as you are processing the next batch after checking the isTruncated value.

Apologies I cant post a comment, else I would have commented on the upvoted answer.

GeekyGags
  • 59
  • 5
0

The reason you are getting only first 1000 objects, because thats how listObjects is desgined to work.

This is from its JavaDoc

Returns some or all (up to 1,000) of the objects in a bucket with each request. 
You can use the request parameters as selection criteria to return a subset of the objects in a bucket. 
A 200 OK response can contain valid or invalid XML. Make sure to design your application to parse the contents of the response and handle it appropriately. 
Objects are returned sorted in an ascending order of the respective key names in the list. For more information about listing objects, see Listing object keys programmatically 

To get paginated results automatically, use listObjectsV2Paginator method

ListObjectsV2Request listReq = ListObjectsV2Request.builder()
                .bucket(bucketName)
                .maxKeys(1)
                .build();

        ListObjectsV2Iterable listRes = s3.listObjectsV2Paginator(listReq);
 // Helper method to work with paginated collection of items directly
        listRes.contents().stream()
                .forEach(content -> System.out.println(" Key: " + content.key() + " size = " + content.size()));

You can opt for manual pagination as well if needed.

Reference: https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/pagination.html

Sanjay Bharwani
  • 3,317
  • 34
  • 31