0

I am currently in the process of designing AWS backed Data Lake.

What I have right now:

  1. XML files uploaded to s3
  2. AWS Glue crawler buids catalogue
  3. AWS ETL job transforms data and saves it in the parquet format.

Each time etl jobs transforms the data it creates new parquet files. I assume that the most efficient way to store my data would be a single parquet file. Is it the case? If so how to achieve this.

Auto generated job code: https://gist.github.com/jkornata/b36c3fa18ae04820c7461adb52dcc1a1

SirKometa
  • 1,857
  • 3
  • 16
  • 26

1 Answers1

0

You can do that by 'overwrite'. Glue doesn`t support 'overwrite' mode. But you can convert DynamicFrame object to spark's DataFrame and write it using spark instead of Glue:

dropnullfields3.toDF()
       .write
       .mode("overwrite")
       .format("parquet")
       .save(s3//output-bucket/[nameOfyourFile].parquet)