42

I am using boto and python and amazon s3.

If I use

[key.name for key in list(self.bucket.list())]

then I get all the keys of all the files.

mybucket/files/pdf/abc.pdf
mybucket/files/pdf/abc2.pdf
mybucket/files/pdf/abc3.pdf
mybucket/files/pdf/abc4.pdf
mybucket/files/pdf/new/
mybucket/files/pdf/new/abc.pdf
mybucket/files/pdf/2011/

what is the best way to

1. either get all folders from s3
2. or from that list just remove the file from the last and get the unique keys of folders

I am thinking of doing like this

set([re.sub("/[^/]*$","/",path) for path in mylist]
peterh
  • 11,875
  • 18
  • 85
  • 108
user1958218
  • 1,571
  • 3
  • 18
  • 28

9 Answers9

49

building on sethwm's answer:

To get the top level directories:

list(bucket.list("", "/"))

To get the subdirectories of files:

list(bucket.list("files/", "/")

and so on.

j1m
  • 616
  • 6
  • 9
  • 4
    That's great and the docs certainly led me in that direction, but I don't seem to get a list of keys. Instead I get a list with a key and a `boto.s3.prefix.Prefix()` object, which I don't really know what do do with. Any ideas? – brice Feb 09 '15 at 16:42
  • 1
    bucket.list does yield a list of prefix objects. The `name` attribute is probably what you're looking for. – Evan Muehlhausen Sep 23 '15 at 21:14
  • 1
    it's important to note that to get the directories, the `prefix` (first parameter) should end with the delimiter – Ciprian Tomoiagă Feb 27 '17 at 15:53
22

This is going to be an incomplete answer since I don't know python or boto, but I want to comment on the underlying concept in the question.

One of the other posters was right: there is no concept of a directory in S3. There are only flat key/value pairs. Many applications pretend certain delimiters indicate directory entries. For example "/" or "\". Some apps go as far as putting a dummy file in place so that if the "directory" empties out, you can still see it in list results.

You don't always have to pull your entire bucket down and do the filtering locally. S3 has a concept of a delimited list where you specific what you would deem your path delimiter ("/", "\", "|", "foobar", etc) and S3 will return virtual results to you, similar to what you want.

http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html ( Look at the delimiter header.)

This API will get you one level of directories. So if you had in your example:

mybucket/files/pdf/abc.pdf
mybucket/files/pdf/abc2.pdf
mybucket/files/pdf/abc3.pdf
mybucket/files/pdf/abc4.pdf
mybucket/files/pdf/new/
mybucket/files/pdf/new/abc.pdf
mybucket/files/pdf/2011/

And you passed in a LIST with prefix "" and delimiter "/", you'd get results:

mybucket/files/

If you passed in a LIST with prefix "mybucket/files/" and delimiter "/", you'd get results:

mybucket/files/pdf/

And if you passed in a LIST with prefix "mybucket/files/pdf/" and delimiter "/", you'd get results:

mybucket/files/pdf/abc.pdf
mybucket/files/pdf/abc2.pdf
mybucket/files/pdf/abc3.pdf
mybucket/files/pdf/abc4.pdf
mybucket/files/pdf/new/
mybucket/files/pdf/2011/

You'd be on your own at that point if you wanted to eliminate the pdf files themselves from the result set.

Now how you do this in python/boto I have no idea. Hopefully there's a way to pass through.

perpetual_check
  • 990
  • 5
  • 14
20

As pointed in one of the comments approach suggested by j1m returns a prefix object. If you are after a name/path you can use variable name. For example:

import boto
import boto.s3

conn = boto.s3.connect_to_region('us-west-2')
bucket = conn.get_bucket(your_bucket)

folders = bucket.list("","/")
for folder in folders:
    print folder.name
Wawrzek
  • 452
  • 5
  • 18
  • If you want to get all of your buckets you can wrap the above in a buckets = conn.get_all_buckets and then for bucket in buckets: and then continue with the bucket.list.... e.g. >>> buckets = S3Connection().get_all_buckets() >>> for bucket in buckets: ... for folder in bucket.list(): ... print folder.name – cgseller Jul 30 '15 at 21:27
19

I found the following to work using boto3:

import boto3
def list_folders(s3_client, bucket_name):
    response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix='', Delimiter='/')
    for content in response.get('CommonPrefixes', []):
        yield content.get('Prefix')

s3_client = boto3.client('s3')
folder_list = list_folders(s3_client, bucket_name)
for folder in folder_list:
    print('Folder found: %s' % folder)

Refs.:

LucyDrops
  • 539
  • 5
  • 15
11

Basically there is no such thing as a folder in S3. Internally everything is stored as a key, and if the key name has a slash character in it, the clients may decide to show it as a folder.

With that in mind, you should first get all keys and then use a regex to filter out the paths that include a slash in it. The solution you have right now is already a good start.

j0nes
  • 8,041
  • 3
  • 37
  • 40
7

I see you have successfully made the boto connection. If you only have one directory that you are interested in (like you provided in the example), I think what you can do is use prefix and delimiter that's already provided via AWS (Link).

Boto uses this feature in its bucket object, and you can retrieve a hierarchical directory information using prefix and delimiter. The bucket.list() will return a boto.s3.bucketlistresultset.BucketListResultSet object.

I tried this a couple ways, and if you do choose to use a delimiter= argument in bucket.list(), the returned object is an iterator for boto.s3.prefix.Prefix, rather than boto.s3.key.Key. In other words, if you try to retrieve the subdirectories you should put delimiter='\' and as a result, you will get an iterator for the prefix object

Both returned objects (either prefix or key object) have a .name attribute, so if you want the directory/file information as a string, you can do so by printing like below:

from boto.s3.connection import S3Connection

key_id = '...'
secret_key = '...'

# Create connection
conn = S3Connection(key_id, secret_key)

# Get list of all buckets
allbuckets = conn.get_all_buckets()
for bucket_name in allbuckets:
    print(bucket_name)

# Connet to a specific bucket
bucket = conn.get_bucket('bucket_name')

# Get subdirectory info
for key in bucket.list(prefix='sub_directory/', delimiter='/'):
    print(key.name)
Erica Jh Lee
  • 121
  • 3
  • 12
  • 2
    Whilst this code snippet is welcome, and may provide some help, it would be [greatly improved if it included an explanation](//meta.stackexchange.com/q/114762) of *how* and *why* this solves the problem. Remember that you are answering the question for readers in the future, not just the person asking now! Please [edit] your answer to add explanation, and give an indication of what limitations and assumptions apply. – Toby Speight Apr 06 '17 at 17:15
  • 1
    @TobySpeight , I added some additional information. Thank you for your comment. – Erica Jh Lee Apr 06 '17 at 19:21
3

The issue here, as has been said by others, is that a folder doesn't necessarily have a key, so you have to search through the strings for the / character and figure out your folders through that. Here's one way to generate a recursive dictionary imitating a folder structure.

If you want all the files and their url's in the folders

assets = {}
  for key in self.bucket.list(str(self.org) + '/'):
    path = key.name.split('/')

    identifier = assets
  for uri in path[1:-1]:
    try:
      identifier[uri]
    except:
      identifier[uri] = {}
    identifier = identifier[uri]

    if not key.name.endswith('/'):
      identifier[path[-1]] = key.generate_url(expires_in=0, query_auth=False)

return assets

If you just want the empty folders

folders = {}
  for key in self.bucket.list(str(self.org) + '/'):
    path = key.name.split('/')

    identifier = folders
  for uri in path[1:-1]:
    try:
      identifier[uri]
    except:
      identifier[uri] = {}
    identifier = identifier[uri]

    if key.name.endswith('/'):
      identifier[path[-1]] = {}

return folders

This can then be recursively read out later.

Nathan Hazzard
  • 1,413
  • 2
  • 12
  • 15
0

the boto interface allows you to list the content of a bucket and give a prefix of the entry. That way you can have the entry for what would be a directory in a normal filesytem :

import boto
AWS_ACCESS_KEY_ID = '...'
AWS_SECRET_ACCESS_KEY = '...'

conn = boto.connect_s3(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
bucket = conn.get_bucket()
bucket_entries = bucket.list(prefix='/path/to/your/directory')

for entry in bucket_entries:
    print entry
bambata
  • 313
  • 1
  • 5
-1

Complete example with boto3 using the S3 client

import boto3


def list_bucket_keys(bucket_name):
    s3_client = boto3.client("s3")
    """ :type : pyboto3.s3 """
    result = s3_client.list_objects(Bucket=bucket_name, Prefix="Trails/", Delimiter="/")
    return result['CommonPrefixes']


if __name__ == '__main__':
    print list_bucket_keys("my-s3-bucket-name")
joeButler
  • 1,643
  • 1
  • 20
  • 41