32

I have a S3 bucket with around 4 million files taking some 500GB in total. I need to sync the files to a new bucket (actually changing the name of the bucket would suffice, but as that is not possible I need to create a new bucket, move the files there, and remove the old one).

I'm using AWS CLI's s3 sync command and it does the job, but takes a lot of time. I would like to reduce the time so that the dependent system downtime is minimal.

I was trying to run the sync both from my local machine and from EC2 c4.xlarge instance and there isn't much difference in time taken.

I have noticed that the time taken can be somewhat reduced when I split the job in multiple batches using --exclude and --include options and run them in parallel from separate terminal windows, i.e.

aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "1?/*" 
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "2?/*" 
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "3?/*" 
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "4?/*" 
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "1?/*" --exclude "2?/*" --exclude "3?/*" --exclude "4?/*"

Is there anything else I can do speed up the sync even more? Is another type of EC2 instance more suitable for the job? Is splitting the job into multiple batches a good idea and is there something like 'optimal' number of sync processes that can run in parallel on the same bucket?

Update

I'm leaning towards the strategy of syncing the buckets before taking the system down, do the migration, and then sync the buckets again to copy only the small number of files that changed in the meantime. However running the same sync command even on buckets with no differences takes a lot of time.

mrt
  • 1,669
  • 3
  • 22
  • 32
  • 500gig of data is going to take a long while to copy, no matter what you do. Disks only have so much bandwidth available. – Marc B Aug 25 '16 at 15:27
  • @MarcB true. Forgot to mention that the migration strategy I lean towards is to sync the buckets before taking the system down. Do the switch and then run sync again to only copy minimal amount of files that changed in the meantime. It looks like the `sync` command takes a lot of time even just to check if the files changed - even if no copying files is actually required. – mrt Aug 25 '16 at 15:31
  • 1
    just how many files are there in this 500gig? even just comparing timestamps would be slow, since it basically calls for a `stat()` operation on every single file. no idea what syncing actually does in the background, but if the backend systems compare physical bytes (in case timestamps didn't change), or hashing the files and comparing hashes, you're STILL reading 2x500gig of data to get those bytes/hashes. – Marc B Aug 25 '16 at 15:32
  • Did you try enabling the accelerated transfer on the buckets? – Piyush Patil Aug 25 '16 at 15:49
  • @error2007s have a look at my update. The operation takes long even if no file transfer is made. – mrt Aug 25 '16 at 16:04
  • Yes I checked that did you try enabling the accelerated transfer and then try the sync/ – Piyush Patil Aug 25 '16 at 16:15
  • @error2007s quickly looking at this page http://docs.aws.amazon.com/AmazonS3/latest/dev/transfer-acceleration.html it seems to serve a different purpose - client to S3 transfer. My scenario involves an internal sync between buckets (both located in the same AWS region) – mrt Aug 25 '16 at 16:41
  • It is not just for Client to S3 it can be used for in your case too. Do one thing enable s3 accelerated transfer on both buckets and then try the sync command and see if it speeds up the process. – Piyush Patil Aug 25 '16 at 16:47
  • @error2007s I appreciate your suggestion, but what are you basing it on? – mrt Aug 25 '16 at 16:52
  • How many objects do you have in the bucket? The CLI needs to list all of them 1000 at a time, meaning there is a request for every 1000 objects in each bucket. Then it needs to compare those. You can speed up transfer by making sure your instance is in the same region as your buckets and that you use an instance with high bandwidth. – Jordon Phillips Aug 25 '16 at 17:46
  • To give an example, I'm syncing 400k objects which are each 4kb in size. All files are being synced, both buckets and the instance (m3.xlarge) are in the same region. This takes around 51 minutes, and that's only around 1.5GB of data. Sync will spin up multiple threads, so running additional syncs likely won't yield any benefit. `c4.xlarge` is probably the best instance for the task. Using transfer acceleration will help a lot if your buckets are in different regions. – Jordon Phillips Aug 25 '16 at 17:58
  • @JordonPhillips `aws s3 sync` bucket to bucket should be using the put+copy API, which allows objects to be copied from bucket to bucket using S3's internal private network -- whether in the same region or different -- not the same as download+upload. – Michael - sqlbot Aug 25 '16 at 18:55
  • That's true, but there's still the transfer rate of your `ListObjects` requests to take into account. The response for 1000 can be around half a megabyte. That's 200MB for my example above. So transfers of large numbers of files, you'll still save good time by being in the same region. – Jordon Phillips Aug 25 '16 at 19:29
  • @mrt Since you've already been able to split up the sync with filters you could simply run each in a separate instance and then tear down the extras after. – Jordon Phillips Aug 25 '16 at 19:34

8 Answers8

21

You can use EMR and S3-distcp. I had to sync 153 TB between two buckets and this took about 9 days. Also make sure the buckets are in the same region because you also get hit with data transfer costs.

aws emr add-steps --cluster-id <value> --steps Name="Command Runner",Jar="command-runner.jar",[{"Args":["s3-dist-cp","--s3Endpoint","s3.amazonaws.com","--src","s3://BUCKETNAME","--dest","s3://BUCKETNAME"]}]

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html

http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-commandrunner.html

strongjz
  • 4,271
  • 1
  • 17
  • 27
15

40100 objects 160gb was copied/sync in less than 90 seconds

follow the below steps:

step1- select the source folder
step2- under the properties of the source folder choose advance setting
step3- enable transfer acceleration and get the endpoint 

AWS configurations one time only (no need to repeat this every time)

aws configure set default.region us-east-1 #set it to your default region
aws configure set default.s3.max_concurrent_requests 2000
aws configure set default.s3.use_accelerate_endpoint true

enter image description here

options :-

--delete : this option will delete the file in destination if its not present in the source

AWS command to sync

aws s3 sync s3://source-test-1992/foldertobesynced/ s3://destination-test-1992/foldertobesynced/ --delete --endpoint-url http://soucre-test-1992.s3-accelerate.amazonaws.com 

transfer acceleration cost

https://aws.amazon.com/s3/pricing/#S3_Transfer_Acceleration_pricing

they have not mentioned pricing if buckets are in the same region enter image description here

Pruthvi Raj
  • 574
  • 8
  • 15
  • 2
    Note transfer acceleration won't work if a bucket has a `.` in its name. `aws` command line will also provide a confusing ": Bucket named is not DNS compatible" error if it's in the config (https://stackoverflow.com/a/41795555/18706). – mahemoff Nov 21 '19 at 10:45
  • Transfert acceleration is a feature that allows your client to connect to the closest AWS Region rather than the region of the S3 bucket. I fail to see how it can make any difference for syncing 2 buckets as the transfert is already inside the AWS Network. 40,000 objects is not a lot, it's only 40 loops of list -> copy – JCMS Jul 26 '22 at 13:28
9

As a variant of what OP is already doing..
One could create a list of all files to be synced, with aws s3 sync --dryrun

aws s3 sync s3://source-bucket s3://destination-bucket --dryrun
# or even
aws s3 ls s3://source-bucket --recursive

Using the list of objects to be synced, split the job into multiple aws s3 cp ... commands. This way, "aws cli" won't be just hanging there, while getting a list of sync candidates, as it does when one starts multiple sync jobs with --exclude "*" --include "1?/*" type arguments.

When all "copy" jobs are done, another sync might be worth it, for good measure, perhaps with --delete, if object might get deleted from "source" bucket.

In case of "source" and "destination" buckets located in different regions, one could enable cross-region bucket replication, before starting to sync the buckets..

anapsix
  • 1,847
  • 20
  • 18
  • I'd be curious to know the speed of this command when executed with a large dataset: `SOURCE_BUCKET="s3://source-bucket"; TARGET_BUCKET="s3://destination-bucket"; aws s3 ls ${SOURCE_BUCKET} --recursive | awk '{print $4}' | xargs -P 64 -I % aws s3 cp ${SOURCE_BUCKET}/% ${TARGET_BUCKET}/%` – Hames Nov 25 '20 at 01:38
7

New option in 2020:

We had to move about 500 terabytes (10 million files) of client data between S3 buckets. Since we only had a month to finish the whole project, and aws sync tops out at about 120megabytes/s... We knew right away this was going to be trouble.

I found this stackoverflow thread first, but when I tried most of the options here, they just weren't fast enough. The main problem is they all rely on serial item-listing. In order to solve the problem, I figured out a way to parallelize listing any bucket without any a priori knowledge. Yes, it can be done!

The open source tool is called S3P.

With S3P we were able to sustain copy speeds of 8 gigabytes/second and listing speeds of 20,000 items/second using a single EC2 instance. (It's a bit faster to run S3P on EC2 in the same region as the buckets, but S3P is almost as fast running on a local machine.)

More info:

Or just try it out:

# Run in any shell to get command-line help. No installation needed:

npx s3p

(requirements nodejs, aws-cli and valid aws-cli credentials)

  • in my scenario I want to sync only those file which have been modified in last hour. Is there a way that s3p can help me achieve that? – Shivam Singh Mar 18 '21 at 12:59
  • Could you please share some examples of how to copy based on the last-modified date? – Shivam Singh Mar 18 '21 at 17:31
  • @ShivamSingh in your scenario, it maybe better to use [S3 same region replication or cross region replication](https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication.html) depending on your use case. – Alex Jul 28 '21 at 16:39
4

Background: The bottlenecks in the sync command is listing objects and copying objects. Listing objects is normally a serial operation, although if you specify a prefix you can list a subset of objects. This is the only trick to parallelizing it. Copying objects can be done in parallel.

Unfortunately, aws s3 sync doesn't do any parallelizing, and it doesn't even support listing by prefix unless the prefix ends in / (ie, it can list by folder). This is why it's so slow.

s3s3mirror (and many similar tools) parallelizes the copying. I don't think it (or any other tools) parallelizes listing objects because this requires a priori knowledge of how the objects are named. However, it does support prefixes and you can invoke it multiple times for each letter of the alphabet (or whatever is appropriate).

You can also roll-your-own using the AWS API.

Lastly, the aws s3 sync command itself (and any tool for that matter) should be a bit faster if you launch it in an instance in the same region as your S3 bucket.

Aleksandr Dubinsky
  • 22,436
  • 15
  • 82
  • 99
2

As explained in recent (May 2020) AWS blog post tiled:

Once can also use S3 replication for existing objects. This requires contacting AWS support to enable this feature:

Customers can copy existing objects to another bucket in the same or different AWS Region by contacting AWS Support to add this functionality to the source bucket.

Marcin
  • 215,873
  • 14
  • 235
  • 294
  • In 2022 is possible to create a replication between buckets (in the same or another region) + a batch job to run the replication for already created objects. Link: https://aws.amazon.com/blogs/storage/replicating-existing-objects-between-s3-buckets/ – Andron Sep 03 '22 at 01:35
0

I used Datasync to migrate 95 TB of data. Took about 2 days. Has all this fancy things for network optimization, parallelization of the jobs. You can even have checks on source and destination to be sure everything transfered as expected.

https://aws.amazon.com/datasync/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc

0

I'm one of the developers of Skyplane, which can copy data across buckets at over 110X speed compared to cloud CLI tools. You can sync two buckets with:

skyplane sync -r s3://bucket-1/ s3://bucket-2/

Underneath the hood, Skyplane creates ephemeral VM instances which parallelize syncing the data across multiple machines (so you're not bottlenecked by disk bandwidth)

swooders
  • 141
  • 1
  • 8