1

We are running DStream applications on Kubernetes cluster using Spark Operator (Spark 2.4.7). Sometimes due to various reasons (OOM's, Kubernetes node restarts) executor pods are getting lost, and while many times Spark sees this and reschedules a new executor, eventually (after a week or more) most of the applications are getting to the state when the executors are not getting rescheduled and the application continues running with less executors than requested. In Spark UI those "forever lost" executors are shown as healthy, but obviously they aren't fetching any data from Kafka. The only way to make sure application works as expected is to recreate the SparkApplication CRD, which basically means hard restart.

You can find the restart policy section of SparkApplication CRD below:

restartPolicy:
  onFailureRetries: 100
  onFailureRetryInterval: 20
  onSubmissionFailureRetries: 5
  onSubmissionFailureRetryInterval: 30
  type: Always
eugen-fried
  • 2,111
  • 3
  • 27
  • 48

0 Answers0