2

As a part of a DAG, I am triggering gcp pyspark dataproc job using below code,

   dag=dag,
   gcp_conn_id=gcp_conn_id,
   region=region,
   main=pyspark_script_location_gcs,
   task_id='pyspark_job_1_submit',
   cluster_name=cluster_name,
   job_name="job_1"
)

How can I pass a variable as parameter to pyspark job that can be accessible in script ?

blackbishop
  • 30,945
  • 11
  • 55
  • 76
j '
  • 191
  • 1
  • 2
  • 12

1 Answers1

3

You can use the paramter arguments of DataProcPySparkOperator:

arguments (list) – Arguments for the job. (templated)

job = DataProcPySparkOperator(
    gcp_conn_id=gcp_conn_id,
    region=region,
    main=pyspark_script_location_gcs,
    task_id='pyspark_job_1_submit',
    cluster_name=cluster_name,
    job_name="job_1",
    arguments=[
        "-arg1=arg1_value", # or just "arg1_value" for non named args
        "-arg2=arg2_value"
    ],
    dag=dag
)
blackbishop
  • 30,945
  • 11
  • 55
  • 76
  • i've a string variable to pass.. so what should be format here ? and in spark script how can i access it ? – j ' Feb 10 '21 at 12:09
  • @j' If you have one variable to passe, just use `arguments=[string_var]` in the operator. To get the variable in pyspark main job, you can use `sys.argv` or better use `argparse` package. you can see example [here](https://stackoverflow.com/questions/32217160/can-i-add-arguments-to-python-code-when-i-submit-spark-job) on how to pass python args – blackbishop Feb 10 '21 at 12:14