I have to run a spark job, (I am new to spark) and getting following error-
[2022-02-16 14:47:45,415] {{bash.py:135}} INFO - Tmp dir root location: /tmp
[2022-02-16 14:47:45,416] {{bash.py:158}} INFO - Running command: spark-submit --class org.xyz.practice.driver.PractitionerDriver s3://pfdt-poc-temp/xyz_test/org.xyz.spark-xy_mvp-1.0.0-SNAPSHOT.jar
[2022-02-16 14:47:45,422] {{bash.py:169}} INFO - Output:
[2022-02-16 14:47:45,423] {{bash.py:173}} INFO - bash: spark-submit: command not found
[2022-02-16 14:47:45,423] {{bash.py:177}} INFO - Command exited with return code 127
[2022-02-16 14:47:45,437] {{taskinstance.py:1482}} ERROR - Task failed with exception
What has to be done,
def run_spark(**kwargs):
import pyspark
sc = pyspark.SparkContext()
df = sc.textFile('s3://demoairflowpawan/people.txt')
logging.info('Number of lines in people.txt = {0}'.format(df.count()))
sc.stop()
spark_task = BashOperator(
task_id='spark_java',
bash_command='spark-submit --class {{ params.class }} {{ params.jar }}',
params={'class': 'org.xyz.practice.driver.PractitionerDriver', 'jar': 's3://pfdt-poc-temp/xyz_test/org.xyz.spark-xy_mvp-1.0.0-SNAPSHOT.jar'},
dag=dag
)