Pyspark connecting to sqlserver using pyodbc fails in cluster mode (deploy-mode)

Question

Question 1: I have this piece of code which works well when run in spark deploy mode : CLIENT, but throws exception when I run the same code in cluster mode. I went though most of the SO question on this topic, but didnt find a solution

I'm using Python 3.7, spark 2.4

My guess: ODBC Driver 13 for SQL Server is available to driver when run in client mode, but the same driver is unavailable in executors at run time, if my guess is correct, how do I make the driver available to executors?

Code:

import pyodbc

def get_sql_server_connection():
    try:
        user_id = "app_server"
        hostname = "hdp-host1"
        conn_driver = "ODBC Driver 13 for SQL Server"
        database = "sales"
        password = "pass1"
        
        conn = pyodbc.connect('DRIVER={%s};SERVER=%s;UID=%s;PWD=%s') % (
            conn_driver, hostname, database, user_id, password
        )
        return conn
    except Exception as e:
        print("Exception occured while establishing connection to SQL SERVER. stacktrace: \n{}".format(e))

Error:

pyodbc.Error: ('01000', "[01000] [unixODBC][Driver Manager]Can't open lib 'ODBC Driver 13 for SQL Server' : file not found (0) (SQLDriverConnect)")

Question 2: How can I get the path where it searches for this driver?

Question 3: How do I pass the driver explicitly along with spark-submit if possible?

Did you find any solution? – BigDataGeek Aug 01 '20 at 14:28 — BigDataGeek, Aug 01 '20 at 14:28

Pyspark connecting to sqlserver using pyodbc fails in cluster mode (deploy-mode)

0 Answers0