33

I have files in S3 bucket. I was trying to download files based on a date, like 08th aug, 09th Aug etc.

I used the following code, but it still downloads the entire bucket:

aws s3 cp s3://bucketname/ folder/file \
    --profile pname \
    --exclude \"*\" \
    --recursive \
    --include \"" + "2015-08-09" + "*\"

I am not sure, how to achieve this. How can I download selective date file?

informatik01
  • 16,038
  • 10
  • 74
  • 104
user4033385
  • 553
  • 1
  • 7
  • 16
  • 2
    Could you please clarify your question? Is the date encoded in your filename, or do you wish to base the download on the creation date of the object? – John Rotenstein Aug 15 '15 at 08:18
  • @JohnRotenstein Filename are '2015-08-15T001Z.log.gz', '2015-08-14T001Z.log.gz', etc. For each day, there will be 24 files ranging from 'T001z.log.gz to T023z.log.gz'. I am looking for way to pick file of particular day for eg: 2015-08-15 which should copy all 24files of this date from S3 to local machine. – user4033385 Aug 16 '15 at 09:40

5 Answers5

62

This command will copy all files starting with 2015-08-15:

aws s3 cp s3://BUCKET/ folder --exclude "*" --include "2015-08-15*" --recursive

If your goal is to synchronize a set of files without copying them twice, use the sync command:

aws s3 sync s3://BUCKET/ folder

That will copy all files that have been added or modified since the previous sync.

In fact, this is the equivalent of the above cp command:

aws s3 sync s3://BUCKET/ folder --exclude "*" --include "2015-08-15*"

References:

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
  • 3
    It takes hours to download logs for the last day when a folder contains logs for several years, because `aws s3 sync ` iterates by all of them and compare with `exclude`/`include` filters. Is there any way to do it faster? – Alexey Nov 27 '18 at 02:19
3

Bash Command to copy all files for specific date or month to current folder

aws s3 ls s3://bucketname/ | grep '2021-02' | awk '{print $4}' | aws s3 cp s3://bucketname/{} folder

Command is doing the following thing

  1. Listing all the files under a bucket
  2. Filtering out all the files of 2021-02 i.e. all files of feb month of 2021
  3. Filtering out only the name of them
  4. running command aws s3 cp on specific files
Anand Tripathi
  • 14,556
  • 1
  • 47
  • 52
  • 1
    Till step 3 it works, I am getting `fatal error: An error occurred (404) when calling the HeadObject operation: Key` in the 4th step, Do i need to pass the file names manually instead of {} ? – Chakreshwar Sharma Mar 10 '21 at 11:21
  • This is good, but the last bit needs to be changed to `xargs -I {} aws s3 cp s3://bucketname/{} folder` in order to work. The shell doesn't magically replace `{}` with whatever was passed from `stdin`, you need xargs. – DBedrenko Nov 25 '22 at 11:40
1

In case your bucket size is large in the upwards of 10 to 20 gigs, this was true in my own personal use case, you can achieve the same goal by using sync in multiple terminal windows.

All the terminal sessions can use the same token, in case you need to generate a token for prod environment.

 $ aws s3 sync s3://bucket-name/sub-name/another-name folder-name-in-pwd/
 --exclude "*" --include "name_date1*"  --profile UR_AC_SomeName

and another terminal window (same pwd)

$ aws s3 sync s3://bucket-name/sub-name/another-name folder-name-in-pwd/ 
    --exclude "*" --include "name_date2*"  --profile UR_AC_SomeName

and another two for "name_date3*" and "name_date4*"

Additionally, you can also do multiple excludes in the same sync command as in:

$ aws s3 sync s3://bucket-name/sub-name/another-name my-local-path/ 
--exclude="*.log/*" --exclude=img --exclude=".error" --exclude=tmp 
--exclude="*.cache"
Nirmal
  • 1,229
  • 1
  • 15
  • 31
0

This Bash Script will copy all files from one bucket to another by modified-date using aws-cli.

aws s3 ls <BCKT_NAME> --recursive | sort | grep "2020-08-*" | cut -b 32- > a.txt

Inside Bash File

while IFS= read -r line; do
    aws s3 cp s3://<SRC_BCKT>/${line} s3://<DEST_BCKT>/${line} --sse AES256
done < a.txt
Mano Sriram
  • 1
  • 1
  • 1
0

aws cli is really slow at this. I waited hours and nothing really happened. So I looked for alternatives.

https://github.com/peak/s5cmd worked great.

supports globs, for example:

s5cmd -numworkers 30 cp 's3://logs-bucket/2022-03-30-19-*' .

is really blazing fast, so you can work with buckets that have s3 access logs without much fuss.

Sankalp
  • 499
  • 7
  • 4