s3 urls - get bucket name and path

Question

I have a variable which has the aws s3 url

s3://bucket_name/folder1/folder2/file1.json

I want to get the bucket_name in a variables and rest i.e /folder1/folder2/file1.json in another variable. I tried the regular expressions and could get the bucket_name like below, not sure if there is a better way.

m = re.search('(?<=s3:\/\/)[^\/]+', 's3://bucket_name/folder1/folder2/file1.json')
print(m.group(0))

How do I get the rest i.e - folder1/folder2/file1.json ?

I have checked if there is a boto3 feature to extract the bucket_name and key from the url, but couldn't find it.

kichik · Accepted Answer · 2019-05-17T17:51:33.877

Since it's just a normal URL, you can use urlparse to get all the parts of the URL.

>>> from urlparse import urlparse
>>> o = urlparse('s3://bucket_name/folder1/folder2/file1.json', allow_fragments=False)
>>> o
ParseResult(scheme='s3', netloc='bucket_name', path='/folder1/folder2/file1.json', params='', query='', fragment='')
>>> o.netloc
'bucket_name'
>>> o.path
'/folder1/folder2/file1.json'

You may have to remove the beginning slash from the key as the next answer suggests.

o.path.lstrip('/')

With Python 3 urlparse moved to urllib.parse so use:

from urllib.parse import urlparse

Here's a class that takes care of all the details.

try:
    from urlparse import urlparse
except ImportError:
    from urllib.parse import urlparse


class S3Url(object):
    """
    >>> s = S3Url("s3://bucket/hello/world")
    >>> s.bucket
    'bucket'
    >>> s.key
    'hello/world'
    >>> s.url
    's3://bucket/hello/world'

    >>> s = S3Url("s3://bucket/hello/world?qwe1=3#ddd")
    >>> s.bucket
    'bucket'
    >>> s.key
    'hello/world?qwe1=3#ddd'
    >>> s.url
    's3://bucket/hello/world?qwe1=3#ddd'

    >>> s = S3Url("s3://bucket/hello/world#foo?bar=2")
    >>> s.key
    'hello/world#foo?bar=2'
    >>> s.url
    's3://bucket/hello/world#foo?bar=2'
    """

    def __init__(self, url):
        self._parsed = urlparse(url, allow_fragments=False)

    @property
    def bucket(self):
        return self._parsed.netloc

    @property
    def key(self):
        if self._parsed.query:
            return self._parsed.path.lstrip('/') + '?' + self._parsed.query
        else:
            return self._parsed.path.lstrip('/')

    @property
    def url(self):
        return self._parsed.geturl()

Watch out though if your filename includes a `#`, in this case, `o.path` won't contain the full key. `urlparse('s3://bucket_name/file #2.json').path == '/file '` — charlax, Feb 28 '19 at 07:46
@charlax Is there a solution which allows for arbitrary file names (eg including a `#`)? — Tom Hale, May 11 '19 at 09:31
You can use `allow_fragments=False` for that. See updated answer. If you want to support `?` too, you can check if `query` is set and add it to the final result. — kichik, May 11 '19 at 16:09
Umm... how about: `s3_filepath = "s3://bucket-name/some/key.txt"` `bucket, key = s3_filepath.replace("s3://", "").split(1)` — Grant Langseth, Jul 14 '20 at 13:10

score 43 · Answer 2 · answered Jun 14 '18 at 19:02

43

A solution that works without urllib or re (also handles preceding slash):

def split_s3_path(s3_path):
    path_parts=s3_path.replace("s3://","").split("/")
    bucket=path_parts.pop(0)
    key="/".join(path_parts)
    return bucket, key

To run:

bucket, key = split_s3_path("s3://my-bucket/some_folder/another_folder/my_file.txt")

Returns:

bucket: my-bucket
key: some_folder/another_folder/my_file.txt

answered Jun 14 '18 at 19:02

mikeviescas

507
4
7

6

Probably better to use [`.partition("/")`](https://docs.python.org/3/library/stdtypes.html#str.partition) rather than `.split("/")` and `.join("/")`. `bucket, _, key = s3_path.replace("s3://","").partition("/")` Also, this is assuming that `s3://` does not appear as a substring in the path itself. No sane person would do that, but maybe an attacker could maybe exploit this? Not sure. – falsePockets Apr 11 '22 at 04:26
1

instead of replace, you can use `[5:]`. So even if `s3://` appears in the path again, that won't be an issue. `bucket, _, key = s3_uri[5:].partition("/")` – Vaibhav Vishal Jan 23 '23 at 18:33

Mikhail Sirotenko · Answer 3 · 2018-08-22T23:07:41.670

35

For those who like me was trying to use urlparse to extract key and bucket in order to create object with boto3. There's one important detail: remove slash from the beginning of the key

from urlparse import urlparse
o = urlparse('s3://bucket_name/folder1/folder2/file1.json')
bucket = o.netloc
key = o.path
boto3.client('s3')
client.put_object(Body='test', Bucket=bucket, Key=key.lstrip('/'))

It took a while to realize that because boto3 doesn't throw any exception.

edited Aug 22 '18 at 23:07

answered Jan 13 '18 at 22:59

Mikhail Sirotenko

960
1
11
16

1

Thanks for the helpful answer, I think you are using `lstrip` twice though, once when assigning to `key` and again when passing `key` to the `put_object` method. It probably doesn't matter unless your keys have two consecutive slashes. That's probably possible with s3 object names. – Davos Mar 26 '18 at 16:47
1

key is o.path, it's not included in the original reply. – Ricardo Mayerhofer Aug 21 '18 at 21:38
1

Thanks @RicardoMayerhofer! Fixed. – Mikhail Sirotenko Aug 22 '18 at 23:08

score 24 · Answer 4 · answered Jul 14 '20 at 13:13

24

Pretty easy to accomplish with a single line of builtin string methods...

s3_filepath = "s3://bucket-name/and/some/key.txt"
bucket, key = s3_filepath.replace("s3://", "").split("/", 1)

answered Jul 14 '20 at 13:13

Grant Langseth

1,527
12
6

score 9 · Answer 5 · answered Nov 06 '17 at 05:36

If you want to do it with regular expressions, you can do the following:

>>> import re
>>> uri = 's3://my-bucket/my-folder/my-object.png'
>>> match = re.match(r's3:\/\/(.+?)\/(.+)', uri)
>>> match.group(1)
'my-bucket'
>>> match.group(2)
'my-folder/my-object.png'

This has the advantage that you can check for the s3 scheme rather than allowing anything there.

score 8 · Answer 6 · answered Mar 07 '21 at 05:46

8

A more recent option is to use cloudpathlib, which implements pathlib functions for files on cloud services (including S3, Google Cloud Storage and Azure Blob Storage).

In addition to those functions, it's easy to get the bucket and the key for your S3 paths.

from cloudpathlib import S3Path

path = S3Path("s3://bucket_name/folder1/folder2/file1.json")

path.bucket
#> 'bucket_name'

path.key
#> 'folder1/folder2/file1.json'

answered Mar 07 '21 at 05:46

hume

2,413
19
21

1

I think for working with cloud and s3, that this is the most practical way! Thanks for your solution – david backx Apr 26 '23 at 12:51

score 7 · Answer 7 · answered Apr 30 '20 at 19:37

7

This is a nice project:

s3path is a pathlib extention for aws s3 service

>>> from s3path import S3Path
>>> path = S3Path.from_uri('s3://bucket_name/folder1/folder2/file1.json')
>>> print(path.bucket)
'/bucket_name'
>>> print(path.key)
'folder1/folder2/file1.json'
>>> print(list(path.key.parents))
[S3Path('folder1/folder2'), S3Path('folder1'), S3Path('.')]

answered Apr 30 '20 at 19:37

Lior Mizrahi

161
1
2
6

this works! and is indeed a nice project. – user3479780 Feb 20 '22 at 23:27

score 4 · Answer 8 · answered Nov 18 '21 at 06:27

4

This can be done smooth

bucket_name, key = s3_uri[5:].split('/', 1)

answered Nov 18 '21 at 06:27

Justin J AR

61
2

score 3 · Answer 9 · answered Nov 01 '19 at 21:31

3

Here it is as a one-liner using regex:

import re

s3_path = "s3://bucket/path/to/key"

bucket, key = re.match(r"s3:\/\/(.+?)\/(.+)", s3_path).groups()

answered Nov 01 '19 at 21:31

David

127
1
6

score 1 · Answer 10 · answered May 04 '21 at 10:07

1

The simplest I do is:

s = 's3://bucket/path1/path2/file.txt'
s1 = s.split('/', 3)
bucket = s1[2]
object_key = s1[3]

answered May 04 '21 at 10:07

Czar

19
1

1

Welcome to StackOverflow @Czar! Your solution is incomplete in that it does not return both "path1" and "path2". – line-o May 04 '21 at 11:27
I'm not sure what you mean. I just tested the code, and I get `object_key='path1/path2/file.txt'`. This solution has the advantage of great simplicity! – joanis May 05 '21 at 13:35

Roland Ayala · Answer 11 · 2021-02-19T02:38:53.293

I use the following regex:

^(?:[s|S]3:\/\/)?([a-zA-Z0-9\._-]+)(?:\/)(.+)$

If match, then S3 parsed parts as follows:

match group1 => S3 bucket name
match group2 => S3 object name

This pattern handles bucket path with or without s3:// uri prefix.

If want to allow other legal bucket name chars, modify [a-zA-Z0-9_-] part of pattern to include other chars as needed.

Complete JS example (in Typescript form)

const S3_URI_PATTERN = '^(?:[s|S]3:\\/\\/)?([a-zA-Z0-9\\._-]+)(?:\\/)(.+)$';

export interface S3UriParseResult {
  bucket: string;
  name: string;
}

export class S3Helper {
  /**
   *
   * @param uri
   */
  static parseUri(uri: string): S3UriParseResult {
    const re = new RegExp(S3_URI_PATTERN);
    const match = re.exec(uri);
    if (!match || (match && match.length !== 3)) {
      throw new Error('Invalid S3 object URI');
    }
    return {
      bucket: match[1],
      name: match[2],
    };
  }
}

s3 urls - get bucket name and path

11 Answers11

Linked

Related