0

I have to run a spark job, (I am new to spark) and getting following error-

[2022-02-16 14:47:45,415] {{bash.py:135}} INFO - Tmp dir root location: /tmp

[2022-02-16 14:47:45,416] {{bash.py:158}} INFO - Running command: spark-submit --class org.xyz.practice.driver.PractitionerDriver s3://pfdt-poc-temp/xyz_test/org.xyz.spark-xy_mvp-1.0.0-SNAPSHOT.jar

[2022-02-16 14:47:45,422] {{bash.py:169}} INFO - Output:

[2022-02-16 14:47:45,423] {{bash.py:173}} INFO - bash: spark-submit: command not found

[2022-02-16 14:47:45,423] {{bash.py:177}} INFO - Command exited with return code 127

[2022-02-16 14:47:45,437] {{taskinstance.py:1482}} ERROR - Task failed with exception

What has to be done,

def run_spark(**kwargs):
  import pyspark
  sc = pyspark.SparkContext()
  df = sc.textFile('s3://demoairflowpawan/people.txt')
  logging.info('Number of lines in people.txt = {0}'.format(df.count()))
  sc.stop()

spark_task = BashOperator(
    task_id='spark_java',
    bash_command='spark-submit --class {{ params.class }} {{ params.jar }}',
    params={'class': 'org.xyz.practice.driver.PractitionerDriver', 'jar': 's3://pfdt-poc-temp/xyz_test/org.xyz.spark-xy_mvp-1.0.0-SNAPSHOT.jar'},
    dag=dag
)
JaySean
  • 125
  • 1
  • 15

1 Answers1

0

The question is - why do you expect the spark-submit to be there? If you created the airflow default pods, then they come with airflow code only.

You can check here an example for spark and airflow - https://medium.com/codex/executing-spark-jobs-with-apache-airflow-3596717bbbe3 - and they state specifically "Spark binaries must be added and mapped".

So you need to figure out how to download the spark binaries to the existing airflow pod.

Alternatively - you can create another k8s job which will do the spark-submit, and have your DAG activate this job.

sorry for the high level answer...

Doron Veeder
  • 77
  • 2
  • 7