0

I have an object in s3 that looks like this:

{'Key': '1111_redshift_us-east-1_dev-ue1-rs-analytics_useractivitylog_2021-05-01T20:18.gz', 'LastModified': datetime.datetime(2021, 5, 24, 19, 14, 40, tzinfo=tzutc()), 'ETag': '"60377db54e3bbcfe7d569b8ea029cfa3-1"', 'Size': 7, 'StorageClass': 'STANDARD'}

and the page from the page_iterator looks like this:

PAGE: {'ResponseMetadata': {'HTTPStatusCode': 200, 'HTTPHeaders': {}, 'RetryAttempts': 0}, 'IsTruncated': False, 'Contents': [{'Key': '1111_redshift_us-east-1_dev-ue1-rs-analytics_connectionlog_2021-05-01T20:18.gz', 'LastModified': datetime.datetime(2021, 5, 24, 19, 14, 40, tzinfo=tzutc()), 'ETag': '"60377db54e3bbcfe7d569b8ea029cfa3-1"', 'Size': 7, 'StorageClass': 'STANDARD'}, {'Key': '1111_redshift_us-east-1_dev-ue1-rs-analytics_notvalidname_2021-05-01T20:18.gz', 'LastModified': datetime.datetime(2021, 5, 24, 19, 14, 40, tzinfo=tzutc()), 'ETag': '"60377db54e3bbcfe7d569b8ea029cfa3-1"', 'Size': 7, 'StorageClass': 'STANDARD'}, {'Key': '1111_redshift_us-east-1_dev-ue1-rs-analytics_useractivitylog_2021-05-01T20:18.gz', 'LastModified': datetime.datetime(2021, 5, 24, 19, 14, 40, tzinfo=tzutc()), 'ETag': '"60377db54e3bbcfe7d569b8ea029cfa3-1"', 'Size': 7, 'StorageClass': 'STANDARD'}, {'Key': '1111_redshift_us-east-1_dev-ue1-rs-analytics_userlog_2021-05-01T20:18.gz', 'LastModified': datetime.datetime(2021, 5, 24, 19, 14, 40, tzinfo=tzutc()),

and I'm trying to do the filter like this:

page_iterator = paginator.paginate(**operation_parameters)
    print(f"FILTER: {filter}")
    # filtered_iterator = page_iterator.search(filter) if filter else page_iterator
    for page in page_iterator:
        print(f"PAGE: {page}")
        for obj in page.get("Contents", []):
            print(f"OBJECT: {obj}")
            yield obj

but I'm not getting objects back. Am I doing the JMESPath filter in search wrong? I'm going by these docs

and my filter is this:

"Contents[?Key[?contains(@, 'useractivitylog') == `true`]]"

What a I doing wrong?

β.εηοιτ.βε
  • 33,893
  • 13
  • 69
  • 83
Jwan622
  • 11,015
  • 21
  • 88
  • 181

2 Answers2

2

Timing different AWS apis and jmespath implementations.

I used a folder and prefix where there around 1500 objects and tested retrieving all them vs a filtered set. Surprisingly, maybe, the list_objects endpoint is much slower than list_objects_v2 endpoint.

Using jmespath is only slightly better than just iterating through the pages using python list comprehension. In the end, all the data is pulled and then filtered. Maybe for a larger directory the results would be more substantial.

%%timeit
keys_list = []
paginator = s3sr.meta.client.get_paginator('list_objects_v2')
for page in  paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/'):
    # print(page)
    # bucket_object_paths = jmespath.search('Contents[*].Key', page)
    bucket_object_paths = jmespath.search("Contents[?contains(Key, 'straddles')].Key", page)
    keys_list.extend(bucket_object_paths)
len(keys_list)
# 450 ms ± 34.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - 1460 objects
# 368 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - filtered

%%timeit
keys_list = []
paginator = s3sr.meta.client.get_paginator('list_objects_v2')
# use Delimiter to limit search to that level of hierarchy
for page in paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/'):
    # keys = [content['Key'] for content in page.get('Contents')]
    keys = [content['Key'] for content in page.get('Contents') if 'straddles' in content['Key']]
    # print('keys in page: ', len(keys))
    keys_list.extend(keys)
len(keys_list)
# 448 ms ± 69.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - 1460 objects
# 398 ms ± 31.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - filtered

%%timeit
client = boto3.client('s3')
paginator = client.get_paginator('list_objects')
page_iterator = paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/')
keys_list = page_iterator.search("Contents[?contains(Key, 'straddles')].Key ")
# keys_list = page_iterator.search("Contents[*].Key ")
len(list(keys_list))
# 948 ms ± 170 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - 1460 objects
# 885 ms ± 48.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Jonathan Leon
  • 5,440
  • 2
  • 6
  • 14
1

The docs can be terribly confusing. Here's a good reference, but even that can be a bit verbose. https://opensourceconnections.com/blog/2015/07/27/advanced-aws-cli-jmespath-query/

There are three filters below, just comment/uncomment the different lines to see how they output the data.

Also, this traverses the entire bucket so it can be time consuming.

bucket='new-bucket-for-lists'
client = boto3.client('s3')
paginator = client.get_paginator('list_objects')
page_iterator = paginator.paginate(Bucket=bucket)
# filtered_iterator = page_iterator.search("Contents[?contains(Key, '.py')] ")
# filtered_iterator = page_iterator.search("Contents[?contains(Key, '.py')][Key, LastModified] ")
filtered_iterator = page_iterator.search("Contents[?contains(Key, '.py')].LastModified ")
for key_data in filtered_iterator:
    print(key_data)
Jonathan Leon
  • 5,440
  • 2
  • 6
  • 14
  • thanks so much. Is this usage of `contains` even outlined in the actual jmespath docs??! – Jwan622 May 25 '21 at 20:56
  • 1
    It is https://jmespath.org/specification.html#functions but unless one spends the time to really understand it, it's not very clear. But once I see it in action, it almost makes perfect sense. – Jonathan Leon May 26 '21 at 00:15
  • 1
    out of curiosity, did you consider just doing the filtering using python/json and evaluating the key name? Wondering if this is truly faster as I have plenty of places I could use this. – Jonathan Leon May 26 '21 at 00:18
  • It's just that I didn't see usage of `contains` without the @ symbol and so your usage is different than what I expected from their docs. – Jwan622 May 26 '21 at 18:14
  • I did not, I've only used jmespath so far since boto3 supports it in the `search` feature so I figured it must be fast? – Jwan622 May 26 '21 at 18:15
  • i did some comparisons. added it in new answer. – Jonathan Leon May 26 '21 at 20:06