1

I have spark and airflow servers differently. And I don't have spark binary in airflow servers. I am able to use SSHOperator and run the spark jobs in cluster mode perfectly well. I would like to know what would be good using either SSHOperator or SparkSubmitOperator in a long run for submitting pyspark jobs. Any help would be appreciated in advance.

kavya
  • 75
  • 1
  • 10

2 Answers2

1

Below are the pros and cons of using SSHOperator vs SparkSubmit Operator in airflow and my recommendation followed.

SSHOperator : This operator will perform SSH action into remote Spark server and execute the spark submit in remote cluster.

Pros:

  1. No additional configuration required in the airflow workers

Cons:

  1. Tough to maintain the spark configuration parameters
  2. Need to enable SSH port 22 from airflow servers to spark servers which leads to security concerns ( though you are on private network its not a best practice to use SSH based remote execution.)

SparkSubbmitOperator : This operator will perform spark submit operation in clean way still you need to have additional infrastructure configuration.

Pros:

  1. As mentioned above it comes with handy spark configuration and no additional effort to invoke spark submit

Cons:

  1. Need to install spark on all airflow servers.

Apart from these 2 options I have listed additional 2 options.

  1. Install Livy server on spark clusters and use python Livy library to interact with Spark servers from Airflow. Refer : https://pylivy.readthedocs.io/en/stable/

  2. If your spark clusters are on AWS EMR , I would encourage to using EmrAddStepsOperator

Refer here for additional discussions : To run Spark Submit programs from a different cluster (1**.1*.0.21) in airflow (1**.1*.0.35). How to connect remotely other cluster in airflow

Abdul
  • 126
  • 3
0

SparkSubmitOperator is a specialized operator. That is, it should make writing tasks for submitting Spark jobs easier and the code itself more readable and maintainable. Therefore, I would use it if possible.

In your case, you should consider if the effort of modifying the infrastructure, such that you can use the SparkSubmitOperator, is worth the benefits, which I mentioned above.

SergiyKolesnikov
  • 7,369
  • 2
  • 26
  • 47