0

I have been trying to SparkSubmit programs in Airflow, but spark files are in a different cluster (1**.1*.0.21) and airflow is in (1**.1*.0.35). I am looking for a detailed explanation of this topic with examples. I cant copy or download any xml files or other files to my airflow cluster.

When I try in SSH hook it says. Though I have many doubts using SSH Operator and BashOperator.

Broken DAG: [/opt/airflow/dags/s.py] No module named paramiko
  • 1
    Does this answer your question? [ImportError: No module named 'paramiko'](https://stackoverflow.com/questions/28173520/importerror-no-module-named-paramiko) – Oleksandr Lykhonosov Jan 01 '20 at 12:45
  • What XML files are you referring to? Hadoop uses XML. All Spark apps would use are spark-env.sh, spark-defaults.conf, and hive-site.xml. These should be bundled with each executor and distributed upon submit... – OneCricketeer Jan 02 '20 at 06:17

3 Answers3

2

I got the connection and here is my code and procedure.

import airflow
from airflow import DAG
from airflow.contrib.operators.ssh_operator import SSHOperator
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta


dag = DAG(dag_id = "spk", description='filer',
          schedule_interval='* * * * *',
          start_date = airflow.utils.dates.days_ago(2),
          params={'project_source': '/home/afzal',
                  'spark_submit': '/usr/hdp/current/spark2-client/bin/spark-submit --principal hdfs-ivory@KDCAUTH.COM --keytab /etc/security/keytabs/hdfs.headless.keytab --master yarn --deploy-mode client airpy.py'})

templated_bash_command = """
            cd {{ params.project_source }}
            {{ params.spark_submit }} 
            """

t1 = SSHOperator(
       task_id="SSH_task",
       ssh_conn_id='spark_21',
       command=templated_bash_command,
       dag=dag
       )

and I also created a connection in 'Admin > Connections' in airflow

Conn Id : spark_21
Conn Type : SSH
Host : mas****p
Username : afzal
Password : ***** 
Port  :
Extra  :

The username and password is used to login to the desired cluster.

0

You can try using Livy In the following python example , my executable jar are on S3.

import json, requests
def spark_submit(master_dns):
        host = 'http://' + master_dns + ':8998'
        data = {"conf": {"spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"},
                'file': "s3://<your driver jar>",
                "jars": ["s3://<dependency>.jar"]
        headers = {'Content-Type': 'application/json'}
        print("Calling request........")
        response = requests.post(host + '/batches', data=json.dumps(data), headers=headers)
        print(response.json())
        return response.headers

I am running the above code wrapped as a python operator from Airflow

Neha Jirafe
  • 741
  • 5
  • 14
0

paramiko is a python library for performing ssh operations. You have to install paramiko to use SSH operator. Simply install the paramiko, command:- pip3 install paramiko.

let me know if you have any problem after installing paramiko.

  • I installed paramiko, but i am having problems with my bash command with spark submit and how to establish connection with the spark cluster in airflow. Please help on this. Thankyou – Afzal Abdul Azeez Jan 03 '20 at 09:16
  • Are you able to SSH to the spark cluster through your airflow cluster without airflow dag?? try " ssh 'spark_cluster_hostname' " if it works and logged in into your spark cluster through airflow cluster then its fine but if not then you have to add public key of airflow cluster to the spark cluster authorized key file. Then you can ssh into the spark cluster. please update me after this and let me know one more thing do the spark ccluster has kerberos security enabled. – Aakash Damle Jan 03 '20 at 09:54
  • The main error was in Admin > Connections . I didn't give the required username and password used to enter in to the desired cluster. – Afzal Abdul Azeez Jan 09 '20 at 07:16