How to override s3 data using Glue job in AWS

Question

I have dynamo db table and i am sending the dynamo db data to s3 using glue job. Whenever running the glue job for updating new data to s3, but it is also appending old data. It should override the old data.Job Script below

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "abc", table_name = "xyz", transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "abc", table_name = "xyz", transformation_ctx = "datasource0")
## @type: ApplyMapping
## @args: [mapping = [("address", "string", "address", "string"), ("name", "string", "name", "string"), ("company", "string", "company", "string"), ("id", "string", "id", "string")], transformation_ctx = "applymapping1"]
## @return: applymapping1
## @inputs: [frame = datasource0]
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("address", "string", "address", "string"), ("name", "string", "name", "string"), ("company", "string", "company", "string"), ("id", "string", "id", "string")], transformation_ctx = "applymapping1")
## @type: ResolveChoice
## @args: [choice = "make_struct", transformation_ctx = "resolvechoice2"]
## @return: resolvechoice2
## @inputs: [frame = applymapping1]
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
## @type: DropNullFields
## @args: [transformation_ctx = "dropnullfields3"]
## @return: dropnullfields3
## @inputs: [frame = resolvechoice2]
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
## @type: DataSink
## @args: [connection_type = "s3", connection_options = {"path": "s3://xyztable"}, format = "parquet", transformation_ctx = "datasink4"]
## @return: datasink4
## @inputs: [frame = dropnullfields3]
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "s3://xyztable"}, format = "parquet", transformation_ctx = "datasink4")
job.commit()

i am getting ( Parse yarn logs get error message: IllegalArgumentException: 'Can not create a Path from an empty string' Tracebackmost recent call last) this error. — htyagi1, May 26 '20 at 08:21
s3 bucket path df.write.mode('overwrite').parquet('s3://xyztable') — htyagi1, May 26 '20 at 09:55

score 2 · Accepted Answer · answered May 26 '20 at 03:46

Replace your second last line with this

df = dropnullfields3.toDF()

df.write.mode('overwrite').parquet('s3://xyzPath')

And it'll replace the folder evertime you run tbe job as glue libraries doesn't support mode as of now so we are using pyspark libs here.

score 0 · Answer 2 · answered May 23 '20 at 23:14

0

If you are trying to overwrite data in s3, DynamicFrame currently does not have a way to change to save mode but you can change toDF() and use the methods shared here

answered May 23 '20 at 23:14

Eman

831
5
8

How to override s3 data using Glue job in AWS

2 Answers2