When I load a MySQL JDBC driver by first copying it to the driver, and then including it via --jars /path/to/jdbc/driver.jar
, then referencing that jdbc driver and loading data into a dataframe succeeds.
$ pyspark --jars /path/to/jdbc/driver.jar
>>> rdd = sqlContext.read.jdbc(url="jdbc:mysql://someAWSDatabase.us-west-2.rds.amazonaws.com:3306?user=root&password=somepassword", table="spark.test", properties={"driver":"com.mysql.jdbc.Driver"})
But, if I load the jar over the publicly available https-hosted version of that exact jar file, it fails.
$ pyspark --jars https://s3/path/to/jdbc/driver.jar
>>> rdd = sqlContext.read.jdbc(url="jdbc:mysql://someAWSDatabase.us-west-2.rds.amazonaws.com:3306?user=root&password=somepassword", table="spark.test", properties={"driver":"com.mysql.jdbc.Driver"})
py4j.protocol.Py4JJavaError: An error occurred while calling o37.jdbc.
: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
...
According to the docs, you can submit jars from various locations, from local to http/https, etc. Why would this cause a different behavior?
Update: I also tried running two spark-submit
jobs, one with each variant of the jars path to the jdbc jar. The https jar submission threw the same error as above.