I have a AWS glue job (PySpark) that needs to load data from a centralized data lake of size 350GB+, prepare it and load into a s3 bucket partitioned by two columns (date and geohash) Mind you that this is PROD data and the environment is PROD. Whenever the job runs, it crashes with the below error.
My glue job has 60 G.1X workers.
++Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: File already exists:s3://<bucket-name>/etl/<directory>/.spark-staging-f163a945-f93c-44b9-bac3-923ec9315275/p_date=2020-11-02/part-00249-f163a945-f93c-44b9-bac3-923ec9315275.c000.snappy.parquet
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 249 in stage 15.0 failed 4 times, most recent failure: Lost task 249.3 in stage 15.0 (TID 16001, 172.36.172.237, executor 21): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I have no idea why this is happening, the spark staging file seems to cause this issue. This wasn't the issue in the DEV environment, I mean the data volume is ceratinly large in PROD but still the code is the same. My SparkConf looks something like this
conf = pyspark.SparkConf().setAll([
("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2"),
("spark.speculation", "false"),
("spark.sql.parquet.enableVectorizedReader", "false"),
("spark.sql.parquet.mergeSchema", "true"),
("spark.sql.crossJoin.enabled", "true"),
("spark.sql.sources.partitionOverwriteMode","dynamic"),
("spark.hadoop.fs.s3.maxRetries", "20"),
("spark.hadoop.fs.s3a.multiobjectdelete.enable", "false")
])
Here is my write to S3 code.
finalDF.write.partitionBy('p_date').save("s3://{bucket}/{basepath}/{table}/".format(bucket=args['DataBucket'], basepath='etl',
table='sessions'), format='parquet', mode="overwrite")
I tried removing the second partition while writing, but still the same issue.
Any help would be appericiated.