50

This stackoverflow answer helped a lot. However, I want to search for all PDFs inside a given bucket.

  1. I click "None".
  2. Start typing.
  3. I type *.pdf
  4. Press Enter

Nothing happens. Is there a way to use wildcards or regular expressions to filter bucket search results via the online S3 GUI console?

Community
  • 1
  • 1
nu everest
  • 9,589
  • 12
  • 71
  • 90

8 Answers8

44

As stated in a comment, Amazon's UI can only be used to search by prefix as per their own documentation:

http://docs.aws.amazon.com/AmazonS3/latest/UG/searching-for-objects-by-prefix.html

There are other methods of searching but they require a bit of effort. Just to name two options, AWS-CLI application or Boto3 for Python.

I know this post is old but it is high on Google's list for s3 searching and does not have an accepted answer. The other answer by Harish is linking to a dead site.

UPDATE 2020/03/03: AWS link above has been removed. This is a link to a very similar topic that was as close as I could find. https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysHierarchy.html

Michael Hohlios
  • 1,188
  • 12
  • 12
  • 7
    Let it be noted that this documentation didn't exist at the time the question was asked. – nu everest Jan 11 '17 at 14:58
  • 9
    Also let it be noted that this documentation no longer exists, and redirects to the documentation home. – Mr Griever Nov 16 '17 at 20:00
  • 11
    Also let it be noted that *not* allowing richer searches and *only* sorting items on the current console page makes things impossible to find in the S3 console. (Definitely send AWS feedback from the console.) – davemyron Jan 22 '20 at 19:32
32

AWS CLI search: In AWS Console,we can search objects within the directory only but not in entire directories, that too with prefix name of the file only(S3 Search limitation).

The best way is to use AWS CLI with below command in Linux OS

aws s3 ls s3://bucket_name/ --recursive | grep search_word | cut -c 32- 

Searching files with wildcards

aws s3 ls s3://bucket_name/ --recursive |grep '*.pdf'
hwjp
  • 15,359
  • 7
  • 71
  • 70
Tech Support
  • 948
  • 11
  • 9
  • Can you explain how this will help me find all PDFs? – nu everest Aug 10 '17 at 23:50
  • 1
    aws s3 ls s3://bucket_name/ --recursive |grep *.pdf – Tech Support Aug 11 '17 at 00:21
  • 1
    I had to use a period: '.*.pdf' - see https://stackoverflow.com/a/1069333/12383690 – Momo Jan 10 '22 at 19:47
  • 1
    If you're going to search for multiple files/patterns and the bucket prefix has many objects, it can save a lot of time to redirect the output to a `.txt` file. This way you only download once and can `grep` repeatedly without network latency – Addison Klinke Sep 08 '22 at 15:05
  • There's a risk this will result in a lot AWS calls. If you have a million objects in your bucket it will take 1000 calls to complete as it pages in blocks of 1000. It's performing the filter client side where prefix searches happen server side. – Philip Couling Feb 10 '23 at 20:01
  • @Momo extending what you said a bit more, the correct syntax for searching for file extensions would be `| grep '.*\.pdf$'` The first period means "any character", the asterisk means 0-inf times, \ is to escape the next period so it literally looks for that (rather than any single character) and the $ at the end means that no characters after this point are accepted. A shorter version is to actually leave off the initial .* entirely (`| grep '\.pdf$'`) as grep matches the word anywhere on the line without the $ anchor – mpag Jun 28 '23 at 17:46
13

You can use the copy function with the --dryrun flag:

aws s3 ls s3://your-bucket/any-prefix/ .\ --recursive --exclude * --include *.pdf --dryrun

It would show all of the files that are PDFs.

Paul Roub
  • 36,322
  • 27
  • 84
  • 93
user11002455
  • 131
  • 1
  • 2
7

If you use boto3 in Python it's quite easy to find the files. Replace 'bucket' with the name of the bucket.

import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('bucket')
for obj in bucket.objects.all():
    if '.pdf' in obj.key:
        print(obj.key)
Matts
  • 1,301
  • 11
  • 30
2

The CLI can do this; aws s3 only supports prefixes, but aws s3api supports arbitrary filtering. For s3 links that look like s3://company-bucket/category/obj-foo.pdf, s3://company-bucket/category/obj-bar.pdf, s3://company-bucket/category/baz.pdf, you can run

aws s3api list-objects --bucket "company-bucket" --prefix "category/" --query "Contents[?ends-with(Key, '.pdf')]"

or for a more general wildcard

aws s3api list-objects --bucket "company-bucket" --prefix "category/" --query "Contents[?contains(Key, 'foo')]"

or even

aws s3api list-objects --bucket "company-bucket" --prefix "category/obj" --query "Contents[?ends_with(Key, '.pdf') && contains(Key, 'ba')]"

The full query language is described at JMESPath.

jkmartin
  • 153
  • 1
  • 7
  • there is a small typo in the last example: It should be `ends_with` as per https://jmespath.org/specification.html#ends-with – Sebastian J. Jul 11 '22 at 21:41
0

The documentation using the Java SDK suggests it can be done:

https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysHierarchy.html https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingObjectKeysUsingJava.html

Specifically the function listObjectsV2Result allows you to specify a prefix filter, e.g. "files/2020-01-02*" so you can only return results matching today's date.

https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/ListObjectsV2Result.html

Philluminati
  • 2,649
  • 2
  • 25
  • 32
0

My guess the files were uploaded from a unix system and your downloading to windows so s3cmd is unable to preserve file permissions which don't apply on NTFS.

To search for files and grab them try this from the target directory or change ./ to target:

for i in `s3cmd ls s3://bucket | grep "searchterm" | awk '{print $4}'`; do s3cmd sync --no-preserve $i ./; done

This works in WSL in windows.

-1

I have used this in one of my project but its a bit of hard coding

import subprocess
bucket = "Abcd"
command = "aws s3 ls s3://"+ bucket + "/sub_dir/ | grep '.csv'"
listofitems = subprocess.check_output(command, shell=True,)
listofitems = listofitems.decode('utf-8')
print([item.split(" ")[-1] for item in listofitems.split("\n")[:-1]])
Deepak Tripathi
  • 3,175
  • 1
  • 8
  • 21