1

I have AWS Glue job that periodically processes and loads batch data into a DeltaLake Table (merge operation) on S3. my settings: Spark 3.1, Python3, Glue 3.0, worker type - G.1X, num of workers - 5.
Everything worked well at first, the average processing time was ~ 25 min. However, the last time I noticed that the job was running for 17 hours and was in the running status.
After forcibly stopping the job and checking the error logs, 4 identical errors were found - Lost executor # on ip address: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
It became clear that all my executors had died, but for some reason the aws glue job continued to work, which unpleasantly surprised me. Yes, aws glue has a job timeout option in the settings, which sets the maximum time for executing a job, but it defaults to 2880 minutes or 48 hours, and I never expected that in case of an error, my job would continue to work until it exceeded the timeout limit.
The error itself is also strange, and does not mean anything in particular. Trying to find the cause of the error, several hypotheses were formulated, the main of which was - out of memory. However, it was not possible to check right away, since my monitoring options were disabled. It was decided to reproduce the error by setting monitoring options: job metrics and continuous logging, as well as set timeout limit to 60 min.
First try - G.1X - 5 workers: enter image description here Within 10 minutes, all my executors died, while the average memory use did not exceed 70%, however, glue job continued to work, and stopped only after 60 min of timeout.
The next day, a 2nd attempt was made, but already on G.2X - 5 workers: enter image description here
No luck.
On the third attempt i add G.2X - 10 workers and managed to complete the task, but only one executer remained active. I was very confused because the total size of the batch data is just about 5 GB.
As a final experiment, I ran a similar script on EC2 16GB RAM, on Spark in local mode with custom configurations and my script completed without any problems! This suggests that the problem is not in computing resources.

The error itself says only that probably some settings parameters have been exceeded, but which ones are not clear.


Thanks in advance for advice.

Igor Goryachev
  • 195
  • 1
  • 7

2 Answers2

0

Solved this issue by setting spark configurations: spark.driver.memory=12g and spark.yarn.executor.memoryOverhead=2024

Confused why these confs not automatically managed by aws glue itself.

Igor Goryachev
  • 195
  • 1
  • 7
  • Hello, i have a similar issue, can you please suggest? https://stackoverflow.com/questions/75199778/aws-glue-executorlostfailure-executor-15-exited-caused-by-one-of-the-running-ta – Vijeth Kashyap Jan 22 '23 at 11:13
0

UPD: configurations with memory still did not help, periodically the cluster crashes, although there are enough resources. I suspect that this is due to the settings of the Glue cluster and compatibility with Delta Lake versions, for example, there are confirmed errors between some EMR releases and Delta (error link). Assume that Glue is a wrapper over EMR. I switched to EMR directly and set up new settings (ReleaseLabel=emr-5.30.0, Spark=2.4.5, io.delta:delta-core_2.11:0.6.1), now i don't have any problems at all.

Igor Goryachev
  • 195
  • 1
  • 7