1

After executing spark-submit command in kubernetes in cluster mode ( --deploy-mode cluster ), it always give exit code as 0 (success) even when the driver pod has failed. Ideally, the main pod should fail (i.e. go to state 'Error') as well if the application fails.

However, this issue does not occur in client deploy mode. In client mode, driver pod is not spawned and the application is executed in main pod itself. As a result, the main pod exits with actual exit code. So, if the application fails then the main pod fails and it goes to state 'Error'.

In cluster mode, observation is as follows when I executed a sample workflow to test this behaviour:

  • a pod comes up with name <workflowname>-<uniqueId> (say htest-3668387602)
  • another pod is spawned with name <sparkAppName>-<uniqueId>-driver (ex: hgigTest-7b241f8-driver). In this pod, application script is executed.
  • Let's say application script fails and exits with exit code '12'
  • Driver pod hgigTest-7b241f8-driver goes to state Error (this is expected as application script exits with non success code)
  • However, main pod htest-3668387602 finishes with state Completed (i.e. success state).
  • Upon checking the exit code of main pod, it shows as 0 (i.e. success) where as it should be 12 (same as that of driver pod)

Implications:

  • Due to this issue, one cannot deduce whether the process was actually successful or not.
  • In argo workflow, workflow is not stopped (i.e. further steps are still executed) even when spark-submit command fails (in cluster mode as mentioned above)

Following is the relevant part of workflow... Here in the args attribute, I have appended commands further to print exit code.

# Sample K8s workflow (relevant part)
....

      - name: task_template_1
        inputs:
          parameters:
            - name: abcd
        container:
            env:
                - name: EnvVar1
                  value: 1234
            image: 9919222323.dkr.ecr.us-east-1.amazonaws.com/gitlab/asdfg/image1:{{workflow.parameters.imageVersion}}
            volumeMounts:
                - mountPath: /home/app/.aws
                  name: aws-creds
                  readOnly: true
            command: [sh, -c, ]
            args: [

              " (/opt/spark/bin/spark-submit \
                --master k8s://https://kubernetes.default.svc \
                --deploy-mode cluster \
                --conf spark.driverEnv.HTTP2_DISABLE=true \
                --conf spark.executorEnv.HTTP2_DISABLE=true \
                --conf spark.driverEnv.KUBERNETES_TLS_VERSIONS='TLSv1.2,TLSv1.3' \
                --conf spark.executorEnv.KUBERNETES_TLS_VERSIONS='TLSv1.2,TLSv1.3' \
                --conf spark.kubernetes.namespace=racenv-dr-pps \
                --conf spark.kubernetes.container.image=9919222323.dkr.ecr.us-east-1.amazonaws.com/gitlab/asdfg/image1:{{workflow.parameters.imageVersion}} \
                --conf spark.jars.ivy=/tmp/.ivy \
                --conf spark.hadoop.fs.s3a.server-side-encryption.enabled=true \
                --conf spark.hadoop.fs.s3a.server-side-encryption-algorithm=SSE-KMS \
                --conf spark.hadoop.fs.s3a.server-side-encryption.key=alias/pod/racenv \
                --conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain \
                --conf spark.kubernetes.driver.podTemplateFile=/opt/application/driver.yaml \
                --conf spark.kubernetes.executor.podTemplateFile=/opt/application/executor.yaml \
                --conf spark.driver.extraJavaOptions=-Dlog4jspark.root.logger=WARN,console
                --conf spark.app.name=hgigTest \
                --conf spark.kubernetes.authenticate.driver.serviceAccountName=dr-spark-submit \
                --conf spark.executor.memory=10g \
                --conf spark.executor.instances=4 \
                --conf spark.driver.memory=10g \
                --conf spark.executor.cores=8 \
                --conf spark.driver.cores=8 \
                --conf spark.kubernetes.executor.limit.cores=4 \
                --conf spark.kubernetes.executor.request.cores=3 \
                --conf spark.kubernetes.driver.request.cores=3 \
                --conf spark.kubernetes.driver.limit.cores=4 \
                --conf spark.kubernetes.executor.limit.memory=6g \
                --conf spark.kubernetes.executor.request.memory=4g \
                --conf spark.kubernetes.driver.limit.memory=6g \
                --conf spark.kubernetes.driver.request.memory=4g \
                --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
                --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
                --jars local:///opt/spark/jars/aws-java-sdk-bundle-1.11.271.jar,local:///opt/spark/jars/hadoop-aws-3.2.0.jar,local:///opt/spark/jars/delta-core_2.12-1.0.0.jar \
                local:///opt/application/ingr.py  --sleepInSec {{workflow.parameters.sleepInSec}} --count {{workflow.parameters.count}}); \
                exit_code_p1=$? ; \
                echo \"exit_code_p1 is ${exit_code_p1} \"; \
                exit $exit_code_p1
                "
                ]

...

Sample end logs of main pod htest-3668387602 :

.......
22/08/03 08:30:35 INFO LoggingPodStatusWatcherImpl: Application status for spark-fad64a94a39bdb0 (phase: Failed)
22/08/03 08:30:35 INFO LoggingPodStatusWatcherImpl: Container final statuses:


     container name: hing
     container image: 9919222323.dkr.ecr.us-east-1.amazonaws.com/gitlab/asdfg/image1:testcode-00
     container state: terminated
     container started at: 2022-08-03T08:28:47Z
     container finished at: 2022-08-03T08:30:28Z
     exit code: 12
     termination reason: Error
22/08/03 08:30:35 INFO LoggingPodStatusWatcherImpl: Application hingTest with submission ID racenv-dr-pps:hgigTest-7b241f8-driver finished
22/08/03 08:30:35 INFO ShutdownHookManager: Shutdown hook called
22/08/03 08:30:35 INFO ShutdownHookManager: Deleting directory /tmp/spark-1c1ef2b1-9871-1bec-91f8-5a8662
exit_code_p1 is 0

Note here that exit code printed is 0 (success) where as the log shows that driver pod fails with exit code 12 (error).

How to make sure that if application fails in cluster mode, then main application pod should also fail i.e. if the driver pod fails then main pod should fail too ?

user
  • 383
  • 1
  • 5
  • 20

0 Answers0