How to distribute JDBC jar on Cloudera cluster?

Question

I've just installed a new Spark 2.4 from CSD on my CDH cluster (28 nodes) and am trying to install JDBC driver in order to read data from a database from within Jupyter notebook. I downloaded and copied it on one node to the /jars folder, however it seems that I have to do the same on each and every host (!). Otherwise I'm getting the following error from one of the workers:

java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.SQLServerDriver

Is there any easy way (without writing bash scripts) to distribute the jar files with packages on the whole cluster? I wish Spark could distribute it itself (or maybe it does and I don't know how to do it).

I just wish that those who vote "-" could be able to read the whole post or provide an explanation before down voting. — michalrudko, May 22 '19 at 08:52

score 2 · Accepted Answer · answered May 22 '19 at 00:37

Spark has a jdbc format reader you can use.

launch a scala shell to confirm your MS SQL Server driver is in your classpath

example

Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")

If driver class isn't showing make sure you place the jar on an edge node and include it in your classpath where you initialize your session

example

bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar

Connect to your MS SQL Server via Spark jdbc

example via spark python

# option1
jdbcDF = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql:dbserver") \
    .option("dbtable", "schema.tablename") \
    .option("user", "username") \
    .option("password", "password") \
    .load()

# option2
jdbcDF2 = spark.read \
    .jdbc("jdbc:postgresql:dbserver", "schema.tablename",
          properties={"user": "username", "password": "password"})

specifics and additional ways to compile connection strings can be found here

https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

you mentioned jupyter ... if you still cannot get the above to work try setting some env vars via this post (cannot confirm if this works though)

https://medium.com/@thucnc/pyspark-in-jupyter-notebook-working-with-dataframe-jdbc-data-sources-6f3d39300bf6

at the end of the day all you really need is the driver class placed on an edge node (client where you launch spark) and append it to your classpath then make the connection and parallelize your dataframe to scale performance since jdbc from rdbms reads data as single thread hence 1 partition

Thanks! I also found a nice workaround by passing the jar file path in the Spark Session config, e.g.: spark.stop() spark = (SparkSession .builder .appName("Test App") .master('yarn-client') .config("spark.executor.cores", "4") .config("spark.executor.instances", "2") .config("spark.sql.shuffle.partitions","8") .config("spark.jars", "/opt/cloudera/parcels/SPARK2/lib/spark2/jars/sqljdbc42.jar") .enableHiveSupport() .getOrCreate()) — michalrudko, May 22 '19 at 10:18

How to distribute JDBC jar on Cloudera cluster?

1 Answers1

example

example

example via spark python

specifics and additional ways to compile connection strings can be found here

you mentioned jupyter ... if you still cannot get the above to work try setting some env vars via this post (cannot confirm if this works though)