AWS Glue append to paruqet file

Question

I am currently in the process of designing AWS backed Data Lake.

What I have right now:

XML files uploaded to s3
AWS Glue crawler buids catalogue
AWS ETL job transforms data and saves it in the parquet format.

Each time etl jobs transforms the data it creates new parquet files. I assume that the most efficient way to store my data would be a single parquet file. Is it the case? If so how to achieve this.

Auto generated job code: https://gist.github.com/jkornata/b36c3fa18ae04820c7461adb52dcc1a1

Dimitar Egumenovski · Answer 1 · 2019-08-16T09:39:05.467

0

You can do that by 'overwrite'. Glue doesn`t support 'overwrite' mode. But you can convert DynamicFrame object to spark's DataFrame and write it using spark instead of Glue:

dropnullfields3.toDF()
       .write
       .mode("overwrite")
       .format("parquet")
       .save(s3//output-bucket/[nameOfyourFile].parquet)

edited Aug 16 '19 at 09:39

answered Aug 16 '19 at 09:14

Dimitar Egumenovski

17
1
4

AWS Glue append to paruqet file

1 Answers1