5

I'm working with Apache Spark 1.4.0 on Windows 7 x64 with Java 1.8.0_45 x64 and Python 2.7.10 x86 in IPython 3.2.0

I am attempting to write a DataFrame-based program in an IPython notebook that reads from and writes back to an SQL Server database.

So far I can read data from the database

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.load(source="jdbc",url="jdbc:sqlserver://serverURL", dbtable="dbName.tableName", driver="com.microsoft.sqlserver.jdbc.SQLServerDriver", user="userName", password="password")

and convert the data to a Panda and do whatever I want to it. (This was more than a little hassle, but it works after adding Microsoft's sqljdbc42.jar to spark.driver.extraClassPath in spark-defaults.conf)

The current problem arises when I go to write the data back to SQL Server with the DataFrameWriter API:

df.write.jdbc("jdbc:sqlserver://serverURL", "dbName.SparkTestTable1", dict(driver="com.microsoft.sqlserver.jdbc.SQLServerDriver", user="userName", password="password"))

---------------------------------------------------------------------------
Py4JError                                 Traceback (most recent call last)
<ipython-input-19-8502a3e85b1e> in <module>()
----> 1 df.write.jdbc("jdbc:sqlserver://jdbc:sqlserver", "dbName.SparkTestTable1", dict(driver="com.microsoft.sqlserver.jdbc.SQLServerDriver", user="userName", password="password"))

C:\Users\User\Downloads\spark-1.4.0-bin-hadoop2.6\python\pyspark\sql\readwriter.pyc in jdbc(self, url, table, mode, properties)
    394         for k in properties:
    395             jprop.setProperty(k, properties[k])
--> 396         self._jwrite.mode(mode).jdbc(url, table, jprop)
    397 
    398 

C:\Python27\lib\site-packages\py4j\java_gateway.pyc in __call__(self, *args)
    536         answer = self.gateway_client.send_command(command)
    537         return_value = get_return_value(answer, self.gateway_client,
--> 538                 self.target_id, self.name)
    539 
    540         for temp_arg in temp_args:

C:\Python27\lib\site-packages\py4j\protocol.pyc in get_return_value(answer, gateway_client, target_id, name)
    302                 raise Py4JError(
    303                     'An error occurred while calling {0}{1}{2}. Trace:\n{3}\n'.
--> 304                     format(target_id, '.', name, value))
    305         else:
    306             raise Py4JError(

Py4JError: An error occurred while calling o49.mode. Trace:
py4j.Py4JException: Method mode([class java.util.HashMap]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
    at py4j.Gateway.invoke(Gateway.java:252)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Unknown Source)

The problem seems to be that py4j cannot find the Java java.util.HashMap class when it goes to convert my connectionProperties dictionary into a JVM object. Adding my rt.jar (with path) to spark.driver.extraClassPath does not not resolve the issue. Removing the dictionary from the write command avoids this error, but of course the write fails to due a lack of driver and authentication.

Edit: The o49.mode part of the error changes from run to run.

Techrocket9
  • 2,026
  • 3
  • 22
  • 33
  • Looking at this question it seems Py4J does not always give accurate errors :) http://mail-archives.apache.org/mod_mbox/spark-user/201507.mbox/%3CCAF0gKfOhpK7zC5d5Y6yLY4cBVUD=pwzrauXzyxQ5C6m7pKwtXA@mail.gmail.com%3E Does dbName.SparkTestTable1 already exist? PySpark by default will raise an error if it exists. https://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes Can you trying changing the mode ```mode="overwrite"``` – jay Jul 14 '15 at 22:57
  • @jay The table does not exist; the intent of the program is to read data out of one table and write it to a new table, just to prove that I can do SQL Server DataFrame IO before I move on to more complex programs. Changing the start of the line to df.write.mode("overwrite").jdbc("jdbc:sqlserver:// does not change the error. – Techrocket9 Jul 14 '15 at 23:27

1 Answers1

6

Davies Liu on the Spark users mailing list found the problem. There is a subtle difference between the Scala and Python APIs that I missed. You have to pass in a mode string (such as "overwrite") as the 3rd parameter in the Python API but not the Scala API. Changing the statement as follows resolves this issue:

df.write.jdbc("jdbc:sqlserver://serverURL", "dbName.SparkTestTable1", "overwrite", dict(driver="com.microsoft.sqlserver.jdbc.SQLServerDriver", user="userName", password="password"))
Techrocket9
  • 2,026
  • 3
  • 22
  • 33