4

I have an AWS S3 bucket with a Prefix (or "folder") called /photos. That "contains" a bunch of image files and even fewer EVENT.json files. A naive representation might look like this:

  • my-awesome-events-bucket
    • photos
      • image1.jpg
      • image2.jpg
      • 1_EVENT.json
      • image3.jpg
      • 2_EVENT.json
      • ...

The EVENT.json files have an object that contains a path reference to an arbitrary amount of image files, which group images into a specific event. Using the example above, image1.jpg and image2.jpg could appear in 1_EVENT.json, and image3.jpg may belong to 2_EVENT.json.

As the bucket gets larger, I have an interest in paging through the results. I only want to request a page at a time from S3 as I need them. The problem I'm running into is that I want to page specifically by keys that contain the word "EVENT". I'm finding this difficult to accomplish without bringing back ALL the objects and then filtering or iterating the results.

Using an S3 Paginator, I'm able to get paging working. Assuming my PageSize and MaxItems are set to 6, this is what I might get back for my first page:

/photos/
/photos/image1.jpg
/photos/image2.jpg
/photos/1_EVENT.json
/photos/image3.jpg
/photos/2_EVENT.json

S3's flat structure means that it's paging through all objects in the bucket according to the Prefix, and limiting and paging according to the pagination parameters. This means that I could easily get multiple EVENT.json files, or none at all, depending on the page.

So I'm looking for something more along the lines of this:

/photos/1_EVENT.json
/photos/2_EVENT.json
/photos/3_EVENT.json
/photos/4_EVENT.json
/photos/5_EVENT.json
/photos/6_EVENT.json

without first having to request all objects and then slice the results set in some way; which is exactly what I'm doing currently:

client = boto3.client('s3')
paginator = client.get_paginator('list_objects_v2')
page_iterator = paginator.paginate(
    Bucket=app.config.get('S3_BUCKET'),
    Prefix="photos/")  # Left PaginationConfig MaxItems & PageSize off intentionally
filtered_iterator = page_iterator.search(
    "Contents[?contains(Key, `EVENT`)][]")
for page in filtered_iterator:
    # Do stuff.
    pass

The above is really expensive, with no paging, but it does give me a list of all files containing my "EVENT" search string.

I specifically want to page results of only EVENT.json objects through S3 using boto3 without the overhead of returning and filtering all objects every request. Is that possible?

EDIT: I'm already narrowing requests down to just objects with the photos/ Prefix. This is because there are other "folders" in my bucket that also may contain EVENT files. That prevents me from using EVENT or EVENT.json as my Prefix, because the response may be polluted by files from other folders.

afilbert
  • 1,430
  • 2
  • 24
  • 25
  • If you just need a list of Amazon S3 content and you do not need it perfectly up-to-date, you could use [Amazon S3 Storage Inventory](http://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html) to store a daily CSV of all files in your S3 bucket. – John Rotenstein Dec 29 '16 at 23:16
  • @JohnRotenstein Storage Inventory doesn't appear to provide any additional structure that would assist in paging results, and is limited to cataloging by prefix according to the [documentation](http://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html#storage-inventory-how-to-set-up). If I could use it to create and maintain an inventory of just EVENT files with a given Prefix, however, the scheduled inventories might be worth the wait. – afilbert Dec 29 '16 at 23:47

2 Answers2

5

The simplest way would be to rehash your filename structure to have the EVENT files follow the pattern photos/EVENT_*.json instead of photos/*_EVENT.json. Then you could use a common prefix of photos/EVENT.

Short of that, I think that the expensive method you are using is actually the only way to go about it.

Kevin Seaman
  • 652
  • 3
  • 9
  • Please use `\`backticks\`` to escape the filenames; `*something between here*` is rendered as italicized text. – Antti Haapala -- Слава Україні Dec 29 '16 at 20:58
  • The filenames are unfortunately being generated by an app that was developed before I came into the project. It's already in client circulation, and will be difficult to change. After looking into this for some time, I've come to the conclusion that I'll either need to change how we're naming and organizing the files, as you suggested, or settle for the overhead of bringing back all objects. I shouldn't be surprised, given "simple" is in the name of the S3 service. I'm alternatively looking into caching and paging the results through RDS. – afilbert Dec 30 '16 at 00:12
0

There is a prefix option you can throw on one of the search functions in boto. This will dramatically reduce the amount of files it has to scan. However if you are having to search strings with wildcards in the middle of the string last I knew it had to scan all the objects in the bucket then you would have to wildcard search though those objects.

ex:

bucket.search_function(prefix="string")

I can't recall the boto function off the top of my head though.

Bob
  • 746
  • 3
  • 11
  • 26
  • I'm unfortunately already using the prefix to limit my results to the /photos "folder", else I'd use EVENT as the prefix and call it a day. Unfortunately, there are also EVENT files in other "folders" in that same bucket that I want to keep from polluting my /photos events. – afilbert Dec 29 '16 at 20:52