delete s3 files from a pipeline AWS

Question

I would like to ask about a processing task I am trying to complete using a data pipeline in AWS, but I have not been able to get it to work.

Basically, I have 2 data nodes representing 2 MySQL databases, where the data is supposed to be extracted from periodically and placed in an S3 bucket. This copy activity is working fine selecting daily every row that has been added, let's say today - 1 day.

However, that bucket containing the collected data as CSVs should become the input for an EMR activity, which will be processing those files and aggregating the information. The problem is that I do not know how to remove or move the already processed files to a different bucket so I do not have to process all the files every day.

To clarify, I am looking for a way to move or remove already processed files in an S3 bucket from a pipeline. Can I do that? Is there any other way I can only process some files in an EMR activity based on a naming convention or something else?

score 6 · Accepted Answer · answered Dec 23 '14 at 19:19

Even better, create a DataPipeline ShellCommandActivity and use the aws command line tools.

Create a script with these two lines:

    sudo yum -y upgrade aws-cli 
    aws s3 rm $1 --recursive

The first line ensures you have the latest aws tools.

The second one removes a directory and all its contents. The $1 is an argument passed to the script.

In your ShellCommandActivity:

    "scriptUri": "s3://myBucket/scripts/theScriptAbove.sh",
    "scriptArgument": "s3://myBucket/myDirectoryToBeDeleted"

The details on how the aws s3 command works are at:

    http://docs.aws.amazon.com/cli/latest/reference/s3/index.html

score 0 · Answer 2 · answered Oct 25 '14 at 00:13

1) Create a script which takes input path and then deletes the files using hadoop fs -rmr s3path. 2) Upload the script to s3

In emr use the prestep - 1) hadoop fs -copyToLocal s3://scriptname . 2) chmod +x scriptname 3) run script

That pretty much it.

piggybox · Answer 3 · 2014-10-30T00:15:51.727

Another approach without using EMR is to install s3cmd tool through ShellCommandActivity in a small EC2 instance, then you can use s3cmd in pipeline to operate your S3 repo in whatever way you want.

A tricky part of this approach is to configure s3cmd through a configuration file safely (basically pass access key and secret), as you can't just ssh into the EC2 instance and use 's3cmd --configure' interactively in a pipeline.

To do that, you create a config file in the ShellCommandActivity using 'cat'. For example:

cat <<EOT >> s3.cfg
blah
blah
blah
EOT

Then use '-c' option to attach the config file every time you call s3cmd like this:

s3cmd -c s3.cfg ls

Sounds complicated, but works.

delete s3 files from a pipeline AWS

3 Answers3

Linked