Using Amazon's Date Pipeline to backup S3 bucket -- how to skip existing files and avoid unnecessary overwriting?

Question

I'm using Amazon's Date Pipeline to copy and S3 bucket to another bucket. It's a pretty straightforward setup, and runs nightly. However, every subsequent run copies the same files over and over--I'd rather it just skip existing files and copy only the new ones, as this backup is going to get quite large in the future. Is there a way to do this??

Is this purely for backup purposes? Are you trying to perform some kind of transformation/ETL on the data before it lands on the second bucket? — thun, Jan 26 '17 at 01:22

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

2

Looking at this thread, it seems to be not possible to do the sync with default CopyActivity:

You can definitely use Data Pipeline to copy one S3 directory to another, with the caveat that, if you use the CopyActivity, it'll be a fully copy, not an rsync. So if you're operating on a large number of files where only a small fraction have changed, the CopyActivity wouldn't be the most efficient way to do it.

You could also write your own logic to perform the diff and then only sync that, and use the CommandRunnerActivity to schedule and manage it.

I think they are actually refer to the ShellCommandActivity which allows you to schedule the shell command to run.

I can't give you an exact configuration example, but here is the example of command you can run with regular cron job to sync two buckets: aws s3 sync s3://source_bucket s3://target_bucket.

It should be possible to run it with ShellCommandActivity. Check also ShellCommandActivity in AWS Data Pipeline, and the comments to the answer here.

Update: the comment by @trevorhinesley with final solution (the default instance launched by the pipeline uses some old aws cli where there is no sync command):

For anyone who comes across this, I had to fire up an EC2 instance, then copy the AMI ID that it used (it's in the info below the list of instances when you select it in the Instances menu under EC2). I used that image ID in the data pipeline and it fixed it!

edited Jun 20 '20 at 09:12

Community

1
1

answered Jan 29 '17 at 13:47

Borys Serebrov

15,636
2
38
54

strangely enough, the EC2 instance that is spun up has an outdated CLI, so the sync command isn't working. How do I ensure it has an updated CLI? – trevorhinesley Jan 29 '17 at 21:24
1

@trevorhinesley that is strange, as aws cli comes pre-installed on machines with Amazon Linux and usually the version is recent. I would try to look into two directions: 1) find which Amazon Linux version is running on the instance and maybe there is a way to specify the version or use custom AMI; 2) try to update aws cli with something like `sudo pip install -U awscli && aws s3 sync ...` – Borys Serebrov Jan 29 '17 at 22:30
For anyone who comes across this, I had to fire up an EC2 instance, then copy the AMI ID that it used (it's in the info below the list of instances when you select it in the Instances menu under EC2). I used that image ID in the data pipeline and it fixed it! – trevorhinesley Jan 30 '17 at 04:16
@trevorhinesley thanks for the update and, by the way, thanks for the question too. Because I am using cron jobs for such tasks and now I'll consider switching to data pipelines as it fits better into the AWS concept - I won't need to setup and manage these cron jobs. – Borys Serebrov Jan 30 '17 at 09:44
You bet! Thanks for the help. I like data pipelines overall. – trevorhinesley Feb 01 '17 at 03:16

score 0 · Answer 2 · answered Oct 04 '19 at 11:12

You could do this to ensure updated aws, first is activity run, next is parameter with value used in run.

{
  "name": "CliActivity",
  "id": "CliActivity",
  "runsOn": {
    "ref": "Ec2Instance"
  },
  "type": "ShellCommandActivity",
  "command": "(sudo yum -y update aws-cli) && (#{myAWSCLICmd})"
},

"parameters": [
{
  "watermark": "aws [options] <command> <subcommand> [parameters]",
  "description": "AWS CLI command",
  "id": "myAWSCLICmd",
  "type": "String"
}],
  "values": {
"myAWSCLICmd": "aws s3 sync s3://source s3://target"}

Using Amazon's Date Pipeline to backup S3 bucket -- how to skip existing files and avoid unnecessary overwriting?

2 Answers2