AWS Gluescript written in pyspark usually works great, creates Parquet files, but occasionally I am missing a Parquet file. How can I ensure / mitigate missing data?
pertinent code is: FinalDF.write.partitionBy("Year", "Month").mode('append').parquet(TARGET)
I can see the S3 folder with lots of parquet files and can find series with naming convention of part-<sequential number> - <guid> which makes it obvious that 1 parquet file is missing e.g. part-00001-c7b1b83c-8a28-49a7-bce8-0c31be30ac30.c000.snappy.parquet
so there is part-00001 through part-00032 ***except *** part-00013 is missing
I can also see log file in cloudwatch which states : WARN [Executor task launch worker for task 587] output.FileOutputCommitter (FileOutputCommitter.java:commitTask(587)): No Output found for attempt_2022 ....
Downloaded source files and they process fine / cannot reproduce issue.
Any ideas on how to avoid / troubleshoot further? Many thanks.
Googled and searched existing posts and searched AWS docs with no luck. Tried to reproduce in dev environment - Cannot reproduce problem. Double checked backup/ DR folder. Has same data, same file is missing there.