Setting up PostgreSQL driver on Azure Databricks

Question

How can I modify the code below to install a PostgreSQL JDBC driver instead of MS SQL? My goal is to use pyodbc to connect to a Redshift database from Azure Databricks. I thought that the PostgreSQL JDBC driver was already installed in my Databricks runtime by default, but when I run pyodbc.drivers() I get just "['ODBC Driver 17 for SQL Server']" so I guess not. And what would the pyodbc.connect() string look like once the PostgreSQL driver is installed? If it's easier, we can instead use Amazon's recommended JDBC driver for Redshift.

curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
curl https://packages.microsoft.com/config/ubuntu/16.04/prod.list > /etc/apt/sources.list.d/mssql-release.list
apt-get update
ACCEPT_EULA=Y apt-get install msodbcsql17
apt-get -y install unixodbc-dev
sudo apt-get install python3-pip -y
pip3 install --upgrade pyodbc

score 0 · Answer 1 · answered Sep 05 '21 at 07:13

0

You need to change your approach and use correct instructions to install and configure the Redshift driver:

install unixodbc-dev via apt-get
download the 64-bit Debian package (actual link is in the documentation)
modify odbcinst.ini with following information (you can store that file on DBFS and copy from it, or copy /opt/amazon/redshiftodbc/lib/64/amazon.redshiftodbc.ini into odbcinst.ini):

[ODBC Drivers]
Amazon Redshift (x64)=Installed

[Amazon Redshift (x64)]
Description=Amazon Redshift ODBC Driver (64-bit)
Driver=/opt/amazon/redshiftodbc/lib/64/libamazonredshiftodbc64.so

install pyodbc via pip

All these steps is better to implement as init script that could be attached to a cluster.

P.S. But do you really need to work through ODBC? Why not use more scalable spark-redshift connector provided by Databricks?

answered Sep 05 '21 at 07:13

Alex Ott

80,552
8
87
132

These instructions are for Linux; I'm using a Windows environment. Could you modify the instructions? – Nate Sep 08 '21 at 14:36
Are you talking about connecting from Azure Databricks environment, or from the local windows machine? If former, then it's a Linux – Alex Ott Sep 08 '21 at 15:30
Azure databricks, but when I download the Debian file I can't look in the files. I would think we have to download the windows file to my windows desktop and then upload that to databricks. – Nate Sep 08 '21 at 15:59
Just put all shell commands like curl etc init shell script, upload it to dbfs and add as init script to a cluster – Alex Ott Sep 08 '21 at 16:13
I'm not fully understanding. Do you mind spelling out exactly what I should put in a script to attach to the cluster? – Nate Sep 08 '21 at 16:43
main question - why do you need to use ODBC instead of `spark-redshift` connector? ODBC will run only on one node, not parallelized – Alex Ott Sep 08 '21 at 17:10
We're using odbc for our other connections and want to keep it consistent. – Nate Sep 08 '21 at 17:24
it's not optimal for Spark... getting to your question - just put the `curl`, `apt-get`, etc. into the script. But you can debug all steps by doing things like `%sh curl ....` inside the Databricks notebook – Alex Ott Sep 08 '21 at 17:32
what curl statement? Do you mind writing out exactly what I should put. Thanks – Nate Sep 08 '21 at 17:33

Setting up PostgreSQL driver on Azure Databricks

1 Answers1