I have a simple glue job where I am using pyspark to read 14million rows from RDS using JDBC and then trying to save it into S3. I can see Output logs in Glue that reading and creating dataframe is quick but while calling write opeation, it fails with the error:
error occurred while calling o89.save. Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5, 10.150.85.95, executor 15): ExecutorLostFailure (executor 15 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
I have tried the following solutions:
- Adding --conf with spark.executor.memory=10g and also with 30g after seeing some solutions on SO.
- Tried to convert spark df to DynamicFrame and then call the save opeartion.
- Tried increasing the workers to 500! And still no luck getting it to pass.
1 weird thing I observed it is, after I create the dataframe by reading from JDBC, it keeps the entire df in 1 partition until I do repartition. But the reading step completes without any error.
I used the same code to run for 6M rows and the job completes in 5 mins. But it fails for 14M rows with the ExecutorLostFailure Error. I also see this error sometimes if I dig deep in the Logs: 2023-01-22 10:36:52,972 WARN [allocator] glue.ExecutorTaskManagement (Logging.scala:logWarning(66)): executor task creation failed for executor 203, restarting within 15 secs. restart reason: Executor task resource limit has been temporarily hit..
Code:
def read_from_db():
logger.info(f'Starts Reading Data from {DB_TABLE} table')
start = time.perf_counter()
filter_query = f'SELECT * FROM {DB_TABLE}'
sql_query = '({}) as query'.format(filter_query)
spark_df = (glueContext.read.format('jdbc')
.option('driver', 'org.postgresql.Driver')
.option('url', JDBC_URL)
.option('dbtable', sql_query)
.option('user', DB_USERS)
.option('password', DB_PASSWORD)
.load()
)
end = time.perf_counter()
logger.info(f'Count of records in DB is {spark_df.count()}')
logger.info(f'Elapsed time for reading records from {DB_TABLE} table = {end - start:0.4f} seconds')
logger.info(f'Finished Reading Data from {DB_TABLE} table')
logger.info(f"Total no. of partitions - {spark_df.rdd.getNumPartitions()}")
# def write_to_s3(spark_df_rep):
# S3_PATH = (
# f"{S3_BUCKET}/all-entities-update/{date}/{cur_time}"
# )
# spark_df_rep.write.format("csv").option("header", "true").save(S3_PATH)
spark_df = spark_df.repartition(20)
logger.info(f"Completed Repartitioning. Total no. of partitions - {spark_df.rdd.getNumPartitions()}")
# spark_df.foreachPartition(write_to_s3)
# spark_dynamic_frame = DynamicFrame.fromDF(spark_df, glueContext, "spark_dynamic_frame")
# logger.info("Conversion to DynmaicFrame compelete")
# glueContext.write_dynamic_frame.from_options(
# frame=spark_dynamic_frame,
# connection_type="s3",
# connection_options={"path": S3_PATH},
# format="csv"
# )
S3_PATH = (
f"{S3_BUCKET}/all-entities-update/{date}/{cur_time}"
)
spark_df.write.format("csv").option("header", "true").save(S3_PATH)
return