Could we use AWS Glue just copy a file from one S3 folder to another S3 folder?

Question

I need to copy a zipped file from one AWS S3 folder to another and would like to make that a scheduled AWS Glue job. I cannot find an example for such a simple task. Please help if you know the answer. May be the answer is in AWS Lambda, or other AWS tools.

Thank you very much!

I'd also consider cost. A Glue job charges a minimum of 10 minutes runtime, whereas a lambda will bill as little as 100ms. — chris.mclennon, Feb 03 '18 at 10:58

tatlar · Accepted Answer · 2019-01-31T15:46:25.647

You can do this, and there may be a reason to use AWS Glue: if you have chained Glue jobs and glue_job_#2 is triggered on the successful completion of glue_job_#1.

The simple Python script below moves a file from one S3 folder (source) to another folder (target) using the boto3 library, and optionally deletes the original copy in source directory.

import boto3

bucketname = "my-unique-bucket-name"
s3 = boto3.resource('s3')
my_bucket = s3.Bucket(bucketname)
source = "path/to/folder1"
target = "path/to/folder2"

for obj in my_bucket.objects.filter(Prefix=source):
    source_filename = (obj.key).split('/')[-1]
    copy_source = {
        'Bucket': bucketname,
        'Key': obj.key
    }
    target_filename = "{}/{}".format(target, source_filename)
    s3.meta.client.copy(copy_source, bucketname, target_filename)
    # Uncomment the line below if you wish the delete the original source file
    # s3.Object(bucketname, obj.key).delete()

Reference: Boto3 Docs on S3 Client Copy

Note: I would use f-strings for generating the target_filename, but f-strings are only supported in >= Python3.6 and I believe the default AWS Glue Python interpreter is still 2.7.

Reference: PEP on f-strings

Is it possible to modify this code to copy to the Glue ```tmp``` folder that is accessible from within a job? https://stackoverflow.com/questions/66376252/error-when-trying-to-copy-to-aws-glue-tmp-folder-in-python-shell — Ravmcgav, Feb 26 '21 at 00:49

score 4 · Answer 2 · answered Dec 05 '17 at 23:09

4

I think you can do it with Glue, but wouldn't it be easier to use the CLI?

You can do the following:

aws s3 sync s3://bucket_1 s3://bucket_2

answered Dec 05 '17 at 23:09

Dave Whittingham

338
3
11

1

The reason for using Glue is that it can be a job, with the job's complication, other jobs can be triggered. – Jie Dec 06 '17 at 16:46
I am not 100% sure Glue us the tool for that. Glue is more of an ETL tool that crawls databases for extract into AWS. Have you had a look at Data Pipeline? http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-copyactivity.html – Dave Whittingham Dec 09 '17 at 14:17
2

An AWS rep told me recently that Data Pipeline was likely to be phased out in favour of Glue ETL over time. Not sure how official that is but I would probably go with Glue ETL if I had to choose between them, seems more likely AWS will be investing in that long term. – Nathan Griffiths Dec 13 '17 at 02:05

score 2 · Answer 3 · answered Dec 10 '17 at 20:11

You could do this with Glue but it's not the right tool for the job.

Far simpler would be to have a Lambda job triggered by a S3 created-object event. There's even a tutorial on AWS Docs on doing (almost) this exact thing.

http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html

score 1 · Answer 4 · answered Feb 05 '18 at 21:37

1

We ended up using Databricks to do everything.

Glue is not ready. It returns error messages that make no sense. We created tickets and waited for five days still no reply.

answered Feb 05 '18 at 21:37

Jie

1,107
1
14
18

score 1 · Answer 5 · answered Aug 01 '18 at 20:45

the S3 API lets you do a COPY command (really a PUT with a header to indicate source URL) to copy objects within or between buckets. It's used to fake rename()s regularly but you could initiate the call yourself, from anything.

There is no need to D/L any data; within the same S3 region the copy has a bandwidth of about 6-10 MB/s.

AWS CLI cp command can do this.

score 0 · Answer 6 · edited Jan 28 '19 at 14:43

0

You can do that by downloading your zip file from s3 to tmp/ directory and then re-uploading the same to s3.

s3 = boto3.resource('s3')

Download file to local spark directory tmp:

s3.Bucket(bucket_name).download_file(DATA_DIR+file,'tmp/'+file)

Upload file from local spark directory tmp:

s3.meta.client.upload_file('tmp/'+file,bucket_name,TARGET_DIR+file)

edited Jan 28 '19 at 14:43

illagrenan

6,033
2
54
66

answered Jul 31 '18 at 17:27

Kishore

41
1

Can you format your answer, it will look awesome then. – Mathews Sunny Jul 31 '18 at 17:49

score 0 · Answer 7 · answered Jan 31 '19 at 19:39

0

Now you can write python shell job in glue to do it. Just select Type in Glue job Creation wizard to Python Shell. You can run normal python script in it.

answered Jan 31 '19 at 19:39

Sandeep Fatangare

2,054
9
14

score 0 · Answer 8 · answered Aug 23 '19 at 12:53

0

Nothing required. I believe aws data pipeline is a best options. Just use command line option. Scheduled run also possible. I already tried. Successfully worked.

answered Aug 23 '19 at 12:53

Selva Ganapathi

1
1

Could we use AWS Glue just copy a file from one S3 folder to another S3 folder?

8 Answers8

Download file to local spark directory tmp:

Upload file from local spark directory tmp:

Linked