3

I have a JDBC connection to an RDS instance and a crawler set up to populate the Data Catalog.

What's the best practice when setting up scheduled runs in order to avoid duplicates and still make the run as efficient as possible? The ETL job output source is S3. The data will then be visualized in QuickSight using Athena or possibly a direct S3 connection, not sure which one is favorable. In the ETL job script (pyspark), different tables are joined and new columns are calculated before storing the final data frame/dynamic frame in S3.

First job run: The data looks something like this (in real life, with a lot more columns and rows): First job run

Second job run: After some time when the job is scheduled to run again, the data has changed too (notice the changes marked with the red boxes): Second job run

Upcoming job run: After some more time has passed the job is scheduled to run again and some more changes could be seen and so on.

What is the recommended setup for an ETL job like this?

Bookmarks: As for my understanding will produce multiple files in S3 which in turn creates duplicates that could be solved using another script.

Overwrite: Using the 'overwrite' option for the data frame

df.repartition(1).write.mode('overwrite').parquet("s3a://target/name")

Today I've been using an overwrite method but it gave me some issues: At some point when I needed to change the ETL job script and the update changed the data stored in S3 too much, my QuickSight Dashboards crashed and could not be replaced with the new data set (build on the new data frame stored in S3) which meant I had to rebuild the Dashboard all over again.

Please give me your best tips and tricks for smoothly performing ETL jobs on randomly updating tables in AWS Glue!

Qwaz
  • 199
  • 9

0 Answers0