I have spark and airflow servers differently. And I don't have spark binary in airflow servers. I am able to use SSHOperator and run the spark jobs in cluster mode perfectly well. I would like to know what would be good using either SSHOperator or SparkSubmitOperator in a long run for submitting pyspark jobs. Any help would be appreciated in advance.
2 Answers
Below are the pros and cons of using SSHOperator vs SparkSubmit Operator in airflow and my recommendation followed.
SSHOperator : This operator will perform SSH action into remote Spark server and execute the spark submit in remote cluster.
Pros:
- No additional configuration required in the airflow workers
Cons:
- Tough to maintain the spark configuration parameters
- Need to enable SSH port 22 from airflow servers to spark servers which leads to security concerns ( though you are on private network its not a best practice to use SSH based remote execution.)
SparkSubbmitOperator : This operator will perform spark submit operation in clean way still you need to have additional infrastructure configuration.
Pros:
- As mentioned above it comes with handy spark configuration and no additional effort to invoke spark submit
Cons:
- Need to install spark on all airflow servers.
Apart from these 2 options I have listed additional 2 options.
Install Livy server on spark clusters and use python Livy library to interact with Spark servers from Airflow. Refer : https://pylivy.readthedocs.io/en/stable/
If your spark clusters are on AWS EMR , I would encourage to using EmrAddStepsOperator
Refer here for additional discussions : To run Spark Submit programs from a different cluster (1**.1*.0.21) in airflow (1**.1*.0.35). How to connect remotely other cluster in airflow

- 126
- 3
SparkSubmitOperator is a specialized operator. That is, it should make writing tasks for submitting Spark jobs easier and the code itself more readable and maintainable. Therefore, I would use it if possible.
In your case, you should consider if the effort of modifying the infrastructure, such that you can use the SparkSubmitOperator, is worth the benefits, which I mentioned above.

- 7,369
- 2
- 26
- 47