103

I'd like to copy some files from a production bucket to a development bucket daily.

For example: Copy productionbucket/feed/feedname/date to developmentbucket/feed/feedname/date

Because the files I want are so deep in the folder structure, it's too time consuming to go to each folder and copy/paste.

I've played around with mounting drives to each bucket and writing a windows batch script, but that is very slow and it unnecessarily downloads all the files/folders to the local server and back up again.

Matt Dell
  • 9,205
  • 11
  • 41
  • 58

15 Answers15

122

Update

As pointed out by alberge (+1), nowadays the excellent AWS Command Line Interface provides the most versatile approach for interacting with (almost) all things AWS - it meanwhile covers most services' APIs and also features higher level S3 commands for dealing with your use case specifically, see the AWS CLI reference for S3:

  • sync - Syncs directories and S3 prefixes. Your use case is covered by Example 2 (more fine grained usage with --exclude, --include and prefix handling etc. is also available):

    The following sync command syncs objects under a specified prefix and bucket to objects under another specified prefix and bucket by copying s3 objects. [...]

    aws s3 sync s3://from_my_bucket s3://to_my_other_bucket
    

For completeness, I'll mention that the lower level S3 commands are also still available via the s3api sub command, which would allow to directly translate any SDK based solution to the AWS CLI before adopting its higher level functionality eventually.


Initial Answer

Moving files between S3 buckets can be achieved by means of the PUT Object - Copy API (followed by DELETE Object):

This implementation of the PUT operation creates a copy of an object that is already stored in Amazon S3. A PUT copy operation is the same as performing a GET and then a PUT. Adding the request header, x-amz-copy-source, makes the PUT operation copy the source object into the destination bucket. Source

There are respective samples for all existing AWS SDKs available, see Copying Objects in a Single Operation. Naturally, a scripting based solution would be the obvious first choice here, so Copy an Object Using the AWS SDK for Ruby might be a good starting point; if you prefer Python instead, the same can be achieved via boto as well of course, see method copy_key() within boto's S3 API documentation.

PUT Object only copies files, so you'll need to explicitly delete a file via DELETE Object still after a successful copy operation, but that will be just another few lines once the overall script handling the bucket and file names is in place (there are respective examples as well, see e.g. Deleting One Object Per Request).

Stephan Schielke
  • 2,744
  • 7
  • 34
  • 40
Steffen Opel
  • 63,899
  • 11
  • 192
  • 211
69

The new official AWS CLI natively supports most of the functionality of s3cmd. I'd previously been using s3cmd or the ruby AWS SDK to do things like this, but the official CLI works great for this.

http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html

aws s3 sync s3://oldbucket s3://newbucket
A B
  • 8,340
  • 2
  • 31
  • 35
  • 4
    This should be up voted to the top of the list. It's the proper way to sync buckets and the most up to date in all these answers. – dft Apr 17 '14 at 20:42
  • If you have trouble with 403 access denied errors, see this blog post. It helped. http://www.alfielapeter.com/posts/8-transferring-s3-bucket-contents-between-accounts-with-s3cmd – crlane Jul 02 '14 at 18:12
  • 3
    cross region copy `aws s3 sync s3://my-bucket-in-eu-west1 s3://my-bucket-in-eu-central1 --source-region=eu-west-1 --region=eu-central-1` – equivalent8 Dec 16 '15 at 09:30
  • if you need to run this ower night on server use `nohup aws s3 sync s3://my-bucket-in-eu-west1 s3://my-bucket-in-eu-central1 --source-region=eu-west-1 --region=eu-central-1 &` http://www.thegeekstuff.com/2010/12/5-ways-to-execute-linux-command/ – equivalent8 Dec 16 '15 at 10:20
  • @alberge Is there any way to provide access key&secret using command line argument? – EmptyData Jun 21 '17 at 16:37
36

I spent days writing my own custom tool to parallelize the copies required for this, but then I ran across documentation on how to get the AWS S3 CLI sync command to synchronize buckets with massive parallelization. The following commands will tell the AWS CLI to use 1,000 threads to execute jobs (each a small file or one part of a multipart copy) and look ahead 100,000 jobs:

aws configure set default.s3.max_concurrent_requests 1000
aws configure set default.s3.max_queue_size 100000

After running these, you can use the simple sync command as follows:

aws s3 sync s3://source-bucket/source-path s3://destination-bucket/destination-path

On an m4.xlarge machine (in AWS--4 cores, 16GB RAM), for my case (3-50GB files) the sync/copy speed went from about 9.5MiB/s to 700+MiB/s, a speed increase of 70x over the default configuration.

Update: Note that S3CMD has been updated over the years and these changes are now only effective when you're working with lots of small files. Also note that S3CMD on Windows (only on Windows) is seriously limited in overall throughput and can only achieve about 3Gbps per process no matter what instance size or settings you use. Other systems like S5CMD have the same problem. I've spoken to the S3 team about this and they're looking into it.

James
  • 3,551
  • 1
  • 28
  • 38
  • Thanks, managed to get over 900+MiB/s with your config, huge speed up over default. – kozyr Nov 08 '17 at 22:00
  • @James: Does API limit us in achieving such high speed transfers? I am using transfermanager api provided by AWS Java SDK vs CLI from a T2 EC2 machine to transfer 2 GB file. The difference in time is ~5.5 times (CLI - 14 seconds) vs (SDK - 80 seconds). Also, I am not seeing any option for s3.max_queue_size in SDK. Any comments? – Dwarrior Feb 22 '19 at 21:37
  • @Dwarrior, both of these settings are for the CLI. When using an SDK, you have to manage all the request queueing yourself. AWS support claims to have hit about 80% of the maximum possible throughput between EC2 and S3 using Linux (ie. the advertised EC2 instance network throughput). Windows is a second-class citizen on AWS and can't get even half that with the Amazon-provided tools, and it looks like they don't plan on fixing that. :-( With a T2 machine, AWS doesn't specify exactly how much bandwidth you get, though things improve somewhat if you set up an S3 VPC endpoint. – James Feb 22 '19 at 23:54
  • @James I went till the extent of parallelizing my list of files over cluster in spark, combining with parallelization within each partition and then using transfermanager for parallel uploads for any given file. I see improvement from 80 to 45 seconds after doing that but still missing on the way CLI handles from EC2. Thanks, for this setup though. It improved performance over windows also drastically. In SDK, we can set max connections but not the queue size, so I think we might have to leave with it. :) Any pointers on how to manage queueing, any sample code that I can take up as baseline. – Dwarrior Feb 23 '19 at 05:02
  • 2
    S5Cmd (https://github.com/peakgames/s5cmd) is the utility the AWS support people used for maximum throughput. Instance size does make a big difference. The new c5n series is very cost-effective for networking and goes all the way up to an amazing 100Gbps. – James Feb 24 '19 at 02:54
  • 1
    A note to whomever tries this. I started this with a 5000 threadpool size and 100k queue size on a bucket with 1.2 TB and ca 75 M objects. After 3 days logging reported a total transferred size of ca 60 GB at 3.3 MiB/s, which would mean that it would take a couple of months for the entire bucket content to be copied. I sent a SIGKILL and a bunch of threads started logging immediately, indicating that the objects actually transfered by then were already around 1/3 of the entire bucket. Thus, if you follow the answer, do not trust the reported transferred rate and size, but monitor the target. – pac Jan 19 '22 at 11:53
30

To move/copy from one bucket to another or the same bucket I use s3cmd tool and works fine. For instance:

s3cmd cp --recursive s3://bucket1/directory1 s3://bucket2/directory1
s3cmd mv --recursive s3://bucket1/directory1 s3://bucket2/directory1
sgimeno
  • 1,883
  • 1
  • 22
  • 35
14

.NET Example as requested:

using (client)
{
    var existingObject = client.ListObjects(requestForExisingFile).S3Objects; 
    if (existingObject.Count == 1)
    {
        var requestCopyObject = new CopyObjectRequest()
        {
            SourceBucket = BucketNameProd,
            SourceKey = objectToMerge.Key,
            DestinationBucket = BucketNameDev,
            DestinationKey = newKey
        };
        client.CopyObject(requestCopyObject);
    }
}

with client being something like

var config = new AmazonS3Config { CommunicationProtocol = Protocol.HTTP, ServiceURL = "s3-eu-west-1.amazonaws.com" };
var client = AWSClientFactory.CreateAmazonS3Client(AWSAccessKey, AWSSecretAccessKey, config);

There might be a better way, but it's just some quick code I wrote to get some files transferred.

Matt Dell
  • 9,205
  • 11
  • 41
  • 58
  • 1
    That's seems like a good solution. but what happens if you have different credentials for the 2 buckets? – Roee Gavirel Sep 11 '14 at 09:27
  • 2
    The credentials are for the execution of the copy command. Those single credentials require appropriate read/write permissions in the source/target buckets. To copy between accounts, then you need to use a bucket policy to allow access to the bucket from the other account's credentials. – Matt Houser Jan 24 '15 at 22:29
13

For me the following command just worked:

aws s3 mv s3://bucket/data s3://bucket/old_data --recursive
lony
  • 6,733
  • 11
  • 60
  • 92
  • 2
    simple and straight forward solution... why use 3rd party tools or workarounds for such simple task when this can be done with aws cli?! – Fr0zenFyr Nov 24 '16 at 06:48
9

If you have a unix host within AWS, then use s3cmd from s3tools.org. Set up permissions so that your key as read access to your development bucket. Then run:

s3cmd cp -r s3://productionbucket/feed/feedname/date s3://developmentbucket/feed/feedname
dk.
  • 2,030
  • 1
  • 22
  • 22
  • Server side? There is no server side for s3. All commands are performed from a remote client. – dk. Jun 27 '13 at 21:55
  • This command seems to work just fine over the internet, by the way! – Gabe Kopley Aug 13 '13 at 22:30
  • 3
    The "server side" question is valid. Does the s3cmd transfer shunt all data over to the client, or is it a direct S3 to S3 transfer? If the former, it would be preferable to run this in the AWS cloud to avoid the external WAN transfers. – Bruce Edge Oct 09 '14 at 20:35
  • 1
    The copying happens all remotely on S3. – dk. Oct 10 '14 at 01:10
  • Also note that if you accidentally interrupt this process `s3cmd cp` does not accept the `--skip-existing` option, you can however run `s3cmd sync` instead with skip existing – ianstarz Nov 14 '14 at 02:59
7

Here is a ruby class for performing this: https://gist.github.com/4080793

Example usage:

$ gem install aws-sdk
$ irb -r ./bucket_sync_service.rb
> from_creds = {aws_access_key_id:"XXX",
                aws_secret_access_key:"YYY",
                bucket:"first-bucket"}
> to_creds = {aws_access_key_id:"ZZZ",
              aws_secret_access_key:"AAA",
              bucket:"first-bucket"}
> syncer = BucketSyncService.new(from_creds, to_creds)
> syncer.debug = true # log each object
> syncer.perform
bantic
  • 4,886
  • 4
  • 29
  • 34
7

Actually as of recently I just use the copy+paste action in the AWS s3 interface. Just navigate to the files you want to copy, click on "Actions" -> "Copy" then navigate to the destination bucket and "Actions" -> "Paste"

It transfers the files pretty quick and it seems like a less convoluted solution that doesn't require any programming, or over the top solutions like that.

Justin Workman
  • 380
  • 3
  • 6
  • Yes. I have discovered the same some minutes ago. I upvoted, so more people will save time :) – JCarlosR Sep 13 '17 at 17:08
  • I tried that on a bucket to bucket copy with 134,364 objects in it. It took hours. And the destination ended up with only 134,333 files -- the copy said that it was "Successful", but there was no explanation for the missing files. – warrens Oct 19 '19 at 16:31
  • Using the "aws s3 sync" type command described in other posts here, all 134,364 objects were copied in about 20 minutes. – warrens Oct 19 '19 at 17:07
4

We had this exact problem with our ETL jobs at Snowplow, so we extracted our parallel file-copy code (Ruby, built on top of Fog), into its own Ruby gem, called Sluice:

https://github.com/snowplow/sluice

Sluice also handles S3 file delete, move and download; all parallelised and with automatic re-try if an operation fails (which it does surprisingly often). I hope it's useful!

Alex Dean
  • 15,575
  • 13
  • 63
  • 74
1

I know this is an old thread but for others who reach there my suggestion is to create a scheduled job to copy content from production bucket to development one.

You can use If you use .NET this article might help you

https://edunyte.com/2015/03/aws-s3-copy-object-from-one-bucket-or/

Nikhil Gaur
  • 1,280
  • 3
  • 19
  • 40
0

If you're working in Python you can use cloudpathlib, which wraps boto3 to copy from one bucket to another.

Because it uses the AWS copy operation when going from an S3 source to an S3 target, it doesn't actually download and then re-upload any data—just asks AWS to move the file to the new location.

First, be sure to be authenticated properly with an ~/.aws/credentials file or environment variables set with an account that can access both buckets. See more options in the cloudpathlib docs.

Here's how you could copy files from one bucket to another:

from cloudpathlib import CloudPath

source = CloudPath("s3://bucket1/source.txt")
destination = CloudPath("s3://bucket2/destination.txt")

# create the source file
source.write_text("hello!")

# destination does not exist
destination.exists()
#> True

# move the source file
source.copy(destination)
#> S3Path('s3://bucket2/destination.txt')

# destination now exists
destination.exists()
#> True

# it has the expected content
destination.read_text()
#> 'hello!'
hume
  • 2,413
  • 19
  • 21
0

I would like to add some stuff to this question.

I got an issue while uploading a .gz file with 75GB.

The error was

An error occurred (InvalidArgument) when calling the UploadPart operation: Part number must be an integer between 1 and 10000, inclusive

I did the below changes

aws configure set default.s3.multipart_chunksize 64MB
aws configure set default.s3.max_concurrent_requests 1000
aws configure set default.s3.max_queue_size 100000

In addition to that, I contact customer support to ensure it'll work fine in future uploads.

Here is the response from AWS customer support.

To mitigate this issue, we have modified the AWS CLI S3 Configuration by executing the below "aws configure set" command to increase the multipart_chunksize value for the default profile:

           $ aws configure set default.s3.multipart_chunksize 64MB

AWS CLI S3 Configuration : https://docs.aws.amazon.com/cli/latest/topic/s3-config.html

Please be informed that the parameter --expected-size (string) is to specify the expected size of a stream in terms of bytes. This argument is needed only when a stream is being uploaded to s3 and the size is larger than 50GB. Failure to include this argument under these conditions may result in a failed upload due to too many parts in upload.

Hence, I would also request you to please provide/pass the parameter --expected-size along with the command that you are executing. For example, please refer the below 'aws s3 cp' command:

           $ aws s3 cp - s3://mybucket/stream.txt --expected-size 54760833024

'aws s3 cp' CLI Reference : https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/cp.html

Ravi Hirani
  • 6,511
  • 1
  • 27
  • 42
0

AWS now also allows replication for S3 buckets: https://aws.amazon.com/s3/features/replication/

You can set up filters if you want to copy specific type of files: https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication-add-config.html#replication-config-optional-filter

Additionally you can use batch replication if it is a one time activity: https://aws.amazon.com/blogs/aws/new-replicate-existing-objects-with-amazon-s3-batch-replication/

Please note that live replication does not copy existing objects. So you'll need to use a combination of batch replication + live replication to sync your S3 buckets.

amit_saxena
  • 7,450
  • 5
  • 49
  • 64
-1

For new version aws2.

aws2 s3 sync s3://SOURCE_BUCKET_NAME s3://NEW_BUCKET_NAME
Ankit Kumar Rajpoot
  • 5,188
  • 2
  • 38
  • 32