How does Partitioning work with AWS Glue Jobs

Question

If I have a Glue Job running every hour but is partitioned by day... what is the expected functionality? Will the job first create a partition for that day and then subsequent jobs append to that partition? Is there any documentation that provides clarity on how this would work?

What I understood is you are looking to understand how Glue creates output partitions to your data but what I think is missing is additional context to be able to assist accurately. Assuming you writing to s3, a partition is effectively just another prefix in the bucket. Hence if your job has `partitionKeys` the value of that partition key will be placed into a prefix with that name. As you process newer dates new prefixes will be created. More here:https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html — Eman, Nov 19 '20 at 06:03
Thank you, I am asking if I am running a job multiple times throughout the day and each time it runs there is a value of the same partitionKey `s3://my_bucket/logs/year=2018/month=01/day=23/ ` will the job add a new file each time it runs to that partition? or will the job append to an existing file in that partition? — sgallagher, Nov 20 '20 at 16:31
So s3 is immutable in nature which means objects you write cannot changed except updating the metadata and deleting the objects. Each job will create a unique file/s (depending on your output partitions) in the prefix. Glue is based on spark so the same way spark will behave is the way Glue will behave except when using dynamic frames we do not have an option to specify the save modes. — Eman, Nov 21 '20 at 09:20

How does Partitioning work with AWS Glue Jobs

0 Answers0