Faster s3 bucket duplication

Question

I have been trying to find a better command line tool for duplicating buckets than s3cmd. s3cmd can duplicate buckets without having to download and upload each file. The command I normally run to duplicate buckets using s3cmd is:

s3cmd cp -r --acl-public s3://bucket1 s3://bucket2

This works, but it is very slow as it copies each file via the API one at a time. If s3cmd could run in parallel mode, I'd be very happy.

Are there other options available as a command line tools or code that people use to duplicate buckets that are faster than s3cmd?

Edit: Looks like s3cmd-modification is exactly what I'm looking for. Too bad it does not work. Are there any other options?

Not sure why this question is repeatedly being closed, since it seems a number of developers have run into it. Anyway, I solved it in highly parallel fashion, here's the link: https://github.com/cobbzilla/s3s3mirror thanks! - jonathan. — cobbzilla, May 17 '13 at 00:51

score 181 · Answer 1 · edited Jan 15 '19 at 16:46

181

AWS CLI seems to do the job perfectly, and has the bonus of being an officially supported tool.

aws s3 sync s3://mybucket s3://backup-mybucket

http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html

Supports concurrent transfers by default. See http://docs.aws.amazon.com/cli/latest/topic/s3-config.html#max-concurrent-requests

To quickly transfer a huge number of small files, run the script from an EC2 instance to decrease latency, and increase max_concurrent_requests to reduce the impact of latency. Eg:

aws configure set default.s3.max_concurrent_requests 200

edited Jan 15 '19 at 16:46

Keet Sugathadasa

11,595
6
65
80

answered May 10 '14 at 05:55

pythonjsgeo

5,122
2
34
47

How fast is it? Does it support concurrent sync-ing? – Phương Nguyễn May 12 '14 at 10:13
5

It supports non-concurrent sync based on file modified time, size, etc. It was blazing fast when I tried it. I believe the objects are copied directly on S3 without downloading them to the local machine. It doesn't run in parallel by default but I'm sure you could have multiple sync commands running on separate subfolders at the same time. It's fast enough that you probably won't need it running in parallel anyway. I just duplicated 100GB of data in a few minutes. – pythonjsgeo May 13 '14 at 01:19
13

Slow as hell if the number of files is high. – Phương Nguyễn Jul 09 '14 at 14:37
14

When transferring many small files latency becomes the key constraint so running this command from an EC2 instance is essential. – pythonjsgeo Aug 27 '14 at 00:23
1

I used this to build a docker and works pretty well https://github.com/sunshineo/s3-bucket-copier – Gordon Sun Jul 14 '15 at 04:10
This command is also much faster when both buckets are in the same region. – ironmouse Jul 07 '16 at 10:35
3

Now it DOES support concurrent syncing :-) http://docs.aws.amazon.com/cli/latest/topic/s3-config.html#max-concurrent-requests – pythonjsgeo Sep 19 '16 at 23:28
1

Works, but it was quite slow, indeed. (About 300GB / Hour in my case) Maybe because I didn't `change default.s3.max_concurrent_requests` to 200. Anyway, I had less than 1.5TB, so I'm glad I tried it without optimizing, that way, I'll be able to compare next time. – Balmipour Apr 05 '17 at 12:54
`aws configure set default.s3.max_queue_size 5000` can also have an impact on the speed, and if you've got the processing power available, higher values for both of these settings can make it go even faster. See https://stackoverflow.com/a/40270349/622140 for 700MB/s+ transfers (that's 2500GB/hr). – James Jan 05 '18 at 13:30
1

Have a look at this blog... this might help https://neelbhatt.com/2017/05/26/best-tools-for-s3-amazon-web-serviceaws/ – Keet Sugathadasa Jan 15 '19 at 15:50
Protip - if you have a lot of files to move, don't waste your time with free-tier EC2 boxes. I just spun up some GIANT EC2 boxes (96 processors!!) to do the `aws sync`, and the whole thing was massively faster. I'll kill the boxes in a few more minutes and I suspect I'll have paid less than $1 to get the job done in 10% of the time. – Eric Jun 08 '20 at 11:07

score 75 · Answer 2 · edited Jul 09 '14 at 08:37

75

If you don't mind using the AWS console, you can:

Select all of the files/folders in the first bucket
Click Actions > Copy
Create a new bucket and select it
Click Actions > Paste

It's still fairly slow, but you can leave it alone and let it do its thing.

edited Jul 09 '14 at 08:37

matt burns

24,742
13
105
107

answered Oct 06 '12 at 15:03

deadwards

1,531
11
18

Is this copying the contents of the source bucket to my machine as it copies to the destination? There's a lot of network activity and browser inspector is extremely slow so it's hard to analyse. 600K/s out on my machine. This would then be a lot faster initiating the transfer within the amazon network... Gonna try there instead. – Brad G Sep 14 '13 at 14:43
1

@BradGoss that's a good question. I assumed it was using the Amazon network to copy the contents of the bucket and wasn't doing any transferring to your local machine, but if you find out differently, let me know. – deadwards Sep 14 '13 at 21:03
9

I just used this method today. It does not pull the files down to your local machine – it does a direct copy and is much, much faster. – Greg Benedict Feb 03 '14 at 13:40
7

It still fetch the file list. If the list is too long (dozen of thousands of files in my case) then it's slow as heck. And timeout/hang is extremely likely – Phương Nguyễn Jul 09 '14 at 14:36
can you logout of the cancel and keep this transfer happen on a background thread? or will logging out cancel the transfer? – haz0rd Oct 16 '14 at 22:41
15

Unfortunately this process is tied to the browser. :( From [the docs](http://docs.aws.amazon.com/AmazonS3/latest/UG/MakingaCopyofanObject.html): "After you initiate copy process you must keep the browser open while the copy is in-progress." – David Lemayian Dec 31 '14 at 11:55
@haz0rd Please see my last comment on this. Hope it helps. – David Lemayian Dec 31 '14 at 19:47
"this process is tied to the browser" ... gaaaah!, wish I read that one earlier :) I'm not recommending you running this 10 minutes before you want to go home – equivalent8 Mar 23 '15 at 16:59
6

I'm trying to do this on a bucket with 8 million files in it. Don't know how many months it's gonna take me to select all the checkboxes... – Chris Harrison Jun 19 '15 at 02:55
This is a very slow process and it is peculiar that it would require my browser to be on for the process to continue. Glancing at the network utilization, I believe this method almost certainly is downloading the 30mb worth of files to my browser and then uploading it back to amazon. I recommend using CLI from an shell sitting on amazon's network. – Ninjaxor Nov 03 '15 at 20:43

score 29 · Answer 3 · edited Jan 15 '19 at 16:47

29

I have tried cloning two buckets using the AWS web console, the s3cmd and the AWS CLI. Although these methods works most of the time, they are painfully slow.

Then I found s3s3mirror : a specialized tool for syncing two S3 buckets. It's multi-threaded and a lot faster than the other approaches I have tried. I quickly moved Giga-bytes of data from one AWS region to another.

Check it out at https://github.com/cobbzilla/s3s3mirror, or download a Docker container from https://registry.hub.docker.com/u/pmoust/s3s3mirror/

edited Jan 15 '19 at 16:47

Keet Sugathadasa

11,595
6
65
80

answered Nov 04 '14 at 16:45

Ketil

291
3
2

1

If you have a lot of files to transfer, this is by far the best tool for the job. Shame it's so far down the list of answers... – John Chrysostom Jan 25 '16 at 13:25
Note to some people: Requires Java 6/7 to compile. – notbrain Apr 25 '16 at 23:48
1

I'm using this from a EC2 instance and it works unbelievably fast! I had to replace the and with the actual bucket name (not the endpoint or something like in AWS CLI). – ironmouse Jun 07 '17 at 09:25
1

Amazing tool, highly recommended over others, for large number of files. Control over number of copy threads is brilliant. – Shaunak Dec 30 '18 at 16:45
Don't you think it is safer to use aws-cli and not third party applications to do the job? After all we need to provide credentials or access keys to use these tools. – Keet Sugathadasa Jan 15 '19 at 15:27

score 11 · Answer 4 · answered Mar 06 '17 at 09:45

11

For adhoc solution use aws cli to sync between buckets:

aws s3 sync speed depends on:
- latency for an API call to S3 endpoint
- amount of API calls made in concurrent

To increase sync speed:
- run aws s3 sync from an AWS instance (c3.large on FreeBSD is OK ;-) )
- update ~/.aws/config with:
-- max_concurrent_requests = 128
-- max_queue_size = 8096

with following config and instance type I was able to sync bucket (309GB, 72K files, us-east-1) within 474 seconds.

For more generic solution consider - AWS DataPipeLine or S3 cross-region replication.

answered Mar 06 '17 at 09:45

Tom Lime

1,154
11
15

Do you know if I could expect the same performance if I used S3 cp? Are you sure that when you used sync it actually transfered all 309GB? sync will only sync files that are not the same or present in the other bucket. – frosty Jul 06 '17 at 22:44
Also, what are your thoughts on this for a use case where I have 1k or less files, but they are larger in size (10gb)? Do you think I would see similar performance to you? – frosty Jul 06 '17 at 23:15
@frosty, in my case destination bucket was empty. per `awscli` doc - `aws sync` copy only new and updated files. probably you should expect high performance with `aws cp` (copying is done internally, your client just issue an API call). performance depends on these factors: 1. latency between src and dst regions (e.g. us-east-X to us-west-X) 2. latency between your client and AWS API endpoint (how fast you can issue an API call) 3. amount of concurrent requests (how many requests per second your client can issue). In my case 309G was copied between buckets in same region (us-east-1) – Tom Lime Jul 22 '17 at 19:07

score 3 · Answer 5 · edited Aug 17 '21 at 11:55

Extending deadwards answer, in 2021 copying objects from one bucket to another takes not more than 2 minutes in AWS console for 1.2 GB data.

Create bucket, enter the bucket name, choose region, copy settings from existing bucket. Create bucket.
Once bucket created, go to the source bucket to which you want to copy the files from.
Select all (if needed or else you can choose desired files and folders), Actions > Copy.
In destination, you need to browse the bucket to which the files and folders to be copied.
Once click the copy button, all the files and folders are copied within a minute or two.

score 3 · Answer 6 · answered Sep 05 '11 at 13:31

3

As this is about Google's first hit on this subject, adding extra information.

'Cyno' made a newer version of s3cmd-modification, which now supports parallel bucket-to-bucket syncing. Exactly what I was waiting for as well.

Pull request is at https://github.com/pcorliss/s3cmd-modification/pull/2, his version at https://github.com/pearltrees/s3cmd-modification

answered Sep 05 '11 at 13:31

Jean-Pierre Deckers

41
2

s3cmd-modification saved me days of copying. – gak Sep 05 '13 at 03:30

score 2 · Answer 7 · answered Jan 12 '11 at 16:05

2

I don't know of any other S3 command line tools but if nothing comes up here, it might be easiest to write your own.

Pick whatever language and Amazon SDK/Toolkit you prefer. Then you just need to list/retrieve the source bucket contents and copy each file (In parallel obviously)

Looking at the source for s3cmd-modification (and I admit I know nothing about python), it looks like they have not parallelised the bucket-to-bucket code but perhaps you could use the standard upload/download parallel code as a starting point to do this.

answered Jan 12 '11 at 16:05

Geoff Appleford

18,538
4
62
85

Yeah. I have been toying with this idea and writing it in ruby with evented manner with event machine or threaded with JRuby. However, s3cmd is already pretty complete and I'd rather just use that. I have been talking with the developer of s3cmd, and he has a couple solutions in the pipeline that will likely address performance problems. – Sean McCleary Jan 12 '11 at 23:35

mdmjsh · Answer 8 · 2020-07-14T10:00:51.960

1

a simple aws s3 cp s3://[original-bucket] s3://[backup-bucket] --recursive works well (assuming you have aws cli setup)

edited Jul 14 '20 at 10:00

answered Jun 19 '20 at 17:37

mdmjsh

915
9
20

score 1 · Answer 9 · edited Jan 21 '22 at 15:33

1

if you have aws console access, use AWS cloudshell and

use below command

aws s3 sync s3://mybucket s3://backup-mybucket

no need to install AWS CLI or any tools.

Command taken from the best answer above. Cloudshell will make sure that your command runs smoothly even if you lose connection and faster too since its straight aws-to-aws. no local machine in between.

edited Jan 21 '22 at 15:33

Guga Nemsitsveridze

721
2
7
27

answered Jan 20 '22 at 19:21

Ashish Nair

11
2

Faster s3 bucket duplication

9 Answers9

Linked