How to search an Amazon S3 Bucket using Wildcards?

Question

This stackoverflow answer helped a lot. However, I want to search for all PDFs inside a given bucket.

I click "None".
Start typing.
I type *.pdf
Press Enter

Nothing happens. Is there a way to use wildcards or regular expressions to filter bucket search results via the online S3 GUI console?

Michael Hohlios · Accepted Answer · 2020-03-03T23:13:30.027

44

As stated in a comment, Amazon's UI can only be used to search by prefix as per their own documentation:

http://docs.aws.amazon.com/AmazonS3/latest/UG/searching-for-objects-by-prefix.html

There are other methods of searching but they require a bit of effort. Just to name two options, AWS-CLI application or Boto3 for Python.

I know this post is old but it is high on Google's list for s3 searching and does not have an accepted answer. The other answer by Harish is linking to a dead site.

UPDATE 2020/03/03: AWS link above has been removed. This is a link to a very similar topic that was as close as I could find. https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysHierarchy.html

edited Mar 03 '20 at 23:13

answered Jan 11 '17 at 02:46

Michael Hohlios

1,188
12
12

7

Let it be noted that this documentation didn't exist at the time the question was asked. – nu everest Jan 11 '17 at 14:58
9

Also let it be noted that this documentation no longer exists, and redirects to the documentation home. – Mr Griever Nov 16 '17 at 20:00
11

Also let it be noted that *not* allowing richer searches and *only* sorting items on the current console page makes things impossible to find in the S3 console. (Definitely send AWS feedback from the console.) – davemyron Jan 22 '20 at 19:32

score 32 · Answer 2 · edited Jul 23 '21 at 20:23

32

AWS CLI search: In AWS Console,we can search objects within the directory only but not in entire directories, that too with prefix name of the file only(S3 Search limitation).

The best way is to use AWS CLI with below command in Linux OS

aws s3 ls s3://bucket_name/ --recursive | grep search_word | cut -c 32-

Searching files with wildcards

aws s3 ls s3://bucket_name/ --recursive |grep '*.pdf'

edited Jul 23 '21 at 20:23

hwjp

15,359
7
71
70

answered Aug 10 '17 at 20:43

Tech Support

948
11
9

Can you explain how this will help me find all PDFs? – nu everest Aug 10 '17 at 23:50
1

aws s3 ls s3://bucket_name/ --recursive |grep *.pdf – Tech Support Aug 11 '17 at 00:21
1

I had to use a period: '.*.pdf' - see https://stackoverflow.com/a/1069333/12383690 – Momo Jan 10 '22 at 19:47
1

If you're going to search for multiple files/patterns and the bucket prefix has many objects, it can save a lot of time to redirect the output to a `.txt` file. This way you only download once and can `grep` repeatedly without network latency – Addison Klinke Sep 08 '22 at 15:05
There's a risk this will result in a lot AWS calls. If you have a million objects in your bucket it will take 1000 calls to complete as it pages in blocks of 1000. It's performing the filter client side where prefix searches happen server side. – Philip Couling Feb 10 '23 at 20:01
@Momo extending what you said a bit more, the correct syntax for searching for file extensions would be `| grep '.*\.pdf$'` The first period means "any character", the asterisk means 0-inf times, \ is to escape the next period so it literally looks for that (rather than any single character) and the $ at the end means that no characters after this point are accepted. A shorter version is to actually leave off the initial .* entirely (`| grep '\.pdf$'`) as grep matches the word anywhere on the line without the $ anchor – mpag Jun 28 '23 at 17:46

score 13 · Answer 3 · edited Feb 01 '19 at 18:01

13

You can use the copy function with the --dryrun flag:

aws s3 ls s3://your-bucket/any-prefix/ .\ --recursive --exclude * --include *.pdf --dryrun

It would show all of the files that are PDFs.

edited Feb 01 '19 at 18:01

Paul Roub

36,322
27
84
93

answered Feb 01 '19 at 17:54

user11002455

131
1
2

I get Unknown options: . --recursive – user2568374 Apr 12 '19 at 19:24
5

That's because @user2568374 had the right idea but the wrong example. It should be: ```aws s3 cp s3://your-bucket/any-prefix/ . --recursive --exclude "*" --include "*.pdf" --dryrun``` – Yossi Jul 29 '19 at 10:53
1

Comparing the AWS docs, it seems `aws s3 cp` has a wildcard feature like Yossi has indicated here, `aws s3 ls` does not seem to have a wildcard feature. – Merlin Sep 22 '20 at 04:20
What is the role of .\ or . here? – zabop Oct 18 '22 at 14:55
@Yossi works only with non-partial directories path – Shoham Aug 15 '23 at 14:50

score 7 · Answer 4 · answered Jun 25 '18 at 23:58

7

If you use boto3 in Python it's quite easy to find the files. Replace 'bucket' with the name of the bucket.

import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('bucket')
for obj in bucket.objects.all():
    if '.pdf' in obj.key:
        print(obj.key)

answered Jun 25 '18 at 23:58

Matts

1,301
11
30

2

It's frustrating that they can't do "Postfix" since they already have a Prefix variable. – Dave Liu Feb 20 '20 at 22:40
suffix........? – jtlz2 Mar 02 '23 at 10:40

jkmartin · Answer 5 · 2022-07-12T22:28:59.630

The CLI can do this; aws s3 only supports prefixes, but aws s3api supports arbitrary filtering. For s3 links that look like s3://company-bucket/category/obj-foo.pdf, s3://company-bucket/category/obj-bar.pdf, s3://company-bucket/category/baz.pdf, you can run

aws s3api list-objects --bucket "company-bucket" --prefix "category/" --query "Contents[?ends-with(Key, '.pdf')]"

or for a more general wildcard

aws s3api list-objects --bucket "company-bucket" --prefix "category/" --query "Contents[?contains(Key, 'foo')]"

or even

aws s3api list-objects --bucket "company-bucket" --prefix "category/obj" --query "Contents[?ends_with(Key, '.pdf') && contains(Key, 'ba')]"

The full query language is described at JMESPath.

there is a small typo in the last example: It should be `ends_with` as per https://jmespath.org/specification.html#ends-with — Sebastian J., Jul 11 '22 at 21:41

score 0 · Answer 6 · answered May 04 '20 at 23:17

The documentation using the Java SDK suggests it can be done:

https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysHierarchy.html https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingObjectKeysUsingJava.html

Specifically the function listObjectsV2Result allows you to specify a prefix filter, e.g. "files/2020-01-02*" so you can only return results matching today's date.

https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/ListObjectsV2Result.html

score 0 · Answer 7 · answered Mar 12 '21 at 06:06

My guess the files were uploaded from a unix system and your downloading to windows so s3cmd is unable to preserve file permissions which don't apply on NTFS.

To search for files and grab them try this from the target directory or change ./ to target:

for i in `s3cmd ls s3://bucket | grep "searchterm" | awk '{print $4}'`; do s3cmd sync --no-preserve $i ./; done

This works in WSL in windows.

score -1 · Answer 8 · answered Feb 04 '20 at 17:56

I have used this in one of my project but its a bit of hard coding

import subprocess
bucket = "Abcd"
command = "aws s3 ls s3://"+ bucket + "/sub_dir/ | grep '.csv'"
listofitems = subprocess.check_output(command, shell=True,)
listofitems = listofitems.decode('utf-8')
print([item.split(" ")[-1] for item in listofitems.split("\n")[:-1]])

How to search an Amazon S3 Bucket using Wildcards?

8 Answers8