2

I know there are allot of similar questions (especially this) asked on SO, but none of the answer actually solves my situation. And ofcourse I know there is no such thing as a folder in S3. Internally everything is stored as a key.

I have a following directory structure;

TWEAKS/date=2020-03-19/hour=20/file.gzip
TWEAKS/date=2020-03-20/hour=21/file.gzip
TWEAKS/date=2020-03-21/hour=22/file.gzip
TWEAKS/date=2020-03-22/hour=23/file.gzip

I tried this;

def list_folders(s3_client, bucket_name):
    response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix='TWEAKS/', Delimiter='/')
    for content in response.get('CommonPrefixes', []):
        yield content.get('Prefix')

s3_client = session.client('s3')
folder_list = list_folders(s3_client, bucket_name)
for folder in folder_list:
    print('Folder found: %s' % folder)

But this only list all directories upto the first level

Folder found: TWEAKS/date=2020-03-19/
Folder found: TWEAKS/date=2020-03-20/
Folder found: TWEAKS/date=2020-03-21/
Folder found: TWEAKS/date=2020-03-22/

Now I cannot add the subdirectory into the Prefix because the names are not same hour=21, hour=22 ... Is there a way to achieve this output ?

Folder found: TWEAKS/date=2020-03-19/hour=20/
Folder found: TWEAKS/date=2020-03-20/hour=21/
Folder found: TWEAKS/date=2020-03-21/hour=22/
Folder found: TWEAKS/date=2020-03-22/hour=23/
Anum Sheraz
  • 2,383
  • 1
  • 29
  • 54
  • You would need to recursively look through every `CommonPrefix`, passing the CommonPrefix as the new `Prefix`, then use the new list of CommonPrefixes. Frankly, it would be easier just to list all objects and then parse the strings, since it requires the fewest API calls. If your bucket is HUGE, then you could consider using [Amazon S3 Inventory](https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html) to obtain a daily CSV file of the bucket's contents. – John Rotenstein Apr 03 '20 at 04:43

2 Answers2

3

I think you'll need to actually enumerate all of the objects, and then infer the unique folder names, something like this:

import os
import boto3

def list_folders(s3_client, bucket_name):
    folders = set()
    response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix='TWEAKS/')

    for content in response.get('Contents', []):
        folders.add(os.path.dirname(content['Key']))

    return sorted(folders)

s3 = boto3.client("s3")
folder_list = list_folders(s3, 'mybucket')

for folder in folder_list:
    print('Folder found: %s' % folder)

Output is:

Folder found: TWEAKS/date=2020-03-19/hour=20
Folder found: TWEAKS/date=2020-03-20/hour=21
Folder found: TWEAKS/date=2020-03-21/hour=22
Folder found: TWEAKS/date=2020-03-22/hour=23
jarmod
  • 71,565
  • 16
  • 115
  • 122
  • 1
    so boto3 doesn't has that option just like the aws cli does ? I actually need to avoid reading all of the files inside directories, which takes allot of time. – Anum Sheraz Apr 02 '20 at 14:26
  • 1
    I don't see how you could do this without enumerating the objects. Generally, there are no folders in S3, it's all an inference from the keys of the objects. Hence you have to list them all to make that inference. – jarmod Apr 02 '20 at 14:44
  • my bad, the aws cli command `aws s3 ls s3:// --recursive` also shows file names. – Anum Sheraz Apr 02 '20 at 14:47
0

I stumbled upon this question while trying to implement an ls for listing s3 objects and "sub-directories" immediately below a given path. (Note that there are no "folders" in S3, only key-value pairs.

While not exactly an answer, it is relevant. And felt I should share as it builds upon jarmod's answer.

import boto3
S3_CLIENT = boto3.client(...)

def ls(bucket_and_path):
    parts = bucket_and_path.split('/')
    bucket, prefix = parts[0], '/'.join(parts[1:])

    if not prefix.endswith('/'):
        prefix += '/'

    # Retrieve results in batches (default list methods will truncate)
    paginator = S3_CLIENT.get_paginator('list_objects')
    page_iterator = paginator.paginate(Bucket=bucket, Prefix=prefix)

    # Get immediate child "folders" and/or files of prefix
    children_of_prefix = set()
    for response in page_iterator:
        for content in response.get('Contents', []):
            full_path_to_object = content['Key']
            relative_path_after_prefix = prefix.join(full_path_to_object.split(prefix)[1:])
            child_of_prefix = relative_path_after_prefix.split('/')[0]
            children_of_prefix.add(child_of_prefix)

    return children_of_prefix

Usage:

>>> ls('my-bucket')
['dir_1', 'dir_2', 'somefile.txt']
>>> ls('my-bucket/dir_1')
['another_file.txt']
Wassadamo
  • 1,176
  • 12
  • 32