I am currently in the process of designing AWS backed Data Lake.
What I have right now:
- XML files uploaded to s3
- AWS Glue crawler buids catalogue
- AWS ETL job transforms data and saves it in the parquet format.
Each time etl jobs transforms the data it creates new parquet files. I assume that the most efficient way to store my data would be a single parquet file. Is it the case? If so how to achieve this.
Auto generated job code: https://gist.github.com/jkornata/b36c3fa18ae04820c7461adb52dcc1a1