1

When I load a MySQL JDBC driver by first copying it to the driver, and then including it via --jars /path/to/jdbc/driver.jar, then referencing that jdbc driver and loading data into a dataframe succeeds.

$ pyspark --jars /path/to/jdbc/driver.jar
>>> rdd = sqlContext.read.jdbc(url="jdbc:mysql://someAWSDatabase.us-west-2.rds.amazonaws.com:3306?user=root&password=somepassword", table="spark.test", properties={"driver":"com.mysql.jdbc.Driver"})

But, if I load the jar over the publicly available https-hosted version of that exact jar file, it fails.

$ pyspark --jars https://s3/path/to/jdbc/driver.jar
>>> rdd = sqlContext.read.jdbc(url="jdbc:mysql://someAWSDatabase.us-west-2.rds.amazonaws.com:3306?user=root&password=somepassword", table="spark.test", properties={"driver":"com.mysql.jdbc.Driver"})

py4j.protocol.Py4JJavaError: An error occurred while calling o37.jdbc.
: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
...

According to the docs, you can submit jars from various locations, from local to http/https, etc. Why would this cause a different behavior?

Update: I also tried running two spark-submit jobs, one with each variant of the jars path to the jdbc jar. The https jar submission threw the same error as above.

Kristian
  • 21,204
  • 19
  • 101
  • 176

0 Answers0