2

I'm trying to use the MS SQL connector for Spark to insert high volumes of data from pyspark.

After creating a session:

        SparkSession.builder
            .config('spark.jars.packages', 'org.apache.hadoop:hadoop-azure:3.2.0,org.apache.spark:spark-avro_2.12:3.1.2,com.microsoft.sqlserver:mssql-jdbc:8.4.1.jre8,com.microsoft.azure:spark-mssql-connector_2.12:1.2.0')
            

I get the following error:

ERROR executor.Executor: Exception in task 6.0 in stage 12.0 (TID 233)
java.lang.NoSuchMethodError: 'void com.microsoft.sqlserver.jdbc.SQLServerBulkCopy.writeToServer(com.microsoft.sqlserver.jdbc.ISQLServerBulkData)'
        at com.microsoft.sqlserver.jdbc.spark.BulkCopyUtils$.bulkWrite(BulkCopyUtils.scala:110)
        at com.microsoft.sqlserver.jdbc.spark.BulkCopyUtils$.savePartition(BulkCopyUtils.scala:58)
        at com.microsoft.sqlserver.jdbc.spark.SingleInstanceWriteStrategies$.$anonfun$write$2(BestEffortSingleInstanceStrategy.scala:43)
        at com.microsoft.sqlserver.jdbc.spark.SingleInstanceWriteStrategies$.$anonfun$write$2$adapted(BestEffortSingleInstanceStrategy.scala:42)
        at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:1020)
        at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:1020)
        at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

When trying to write data like this:

        try:
            (
                df.write.format("com.microsoft.sqlserver.jdbc.spark")
                    .mode("append")
                    .option("url", url)
                    .option("dbtable", table_name)
                    .option("user", username)
                    .option("password", password)
                    .option("schemaCheckEnabled", "false")
                    .save()
            )
        except ValueError as error:
            print("Connector write failed", error)

I tried different versions of spark and the sql connector but no luck so far. I also tried using a jar for the mssql-jdbc dependency directly:

SparkSession.builder
   .config('spark.jars', '/mssql-jdbc-8.4.1.jre8.jar')
   .config(...)

It still complains that it can't find the method, however if you inspect the JAR file, the method is defined in the source code.

Any tips on where to look are welcome!

warreee
  • 437
  • 4
  • 12

2 Answers2

0

We reproduced the same scenario in our environment and it's correctly working now.

There is an issue in JDBC driver 8.2.2 you can use the older version for the library.

Below is the code sample,

enter image description here

Output:

enter image description here

Data got inserted into table from pyspark.

Reference: NoSuchMethodError for BulkCopy.

RajkumarPalnati
  • 541
  • 2
  • 6
  • 1
    Please see [Please do not upload images of code/data/errors when asking a question.](//meta.stackoverflow.com/q/285551) - images of text are unusable for many reasons. – David Makogon Jun 25 '22 at 15:38
  • I tried that but it didn't resolve anything. FYI, you are not using the MS SQL spark connector, you are using standard JDBC connector, if you want to change that, you have to change the format to "com.microsoft.sqlserver.jdbc.spark". – warreee Jun 27 '22 at 15:14
0

Using com.microsoft.sqlserver:mssql-jdbc:8.4.1.jre8 is one thing but also you need proper version of MS' Spark SQL Connector, compatible with your Spark's version.

com.microsoft.azure:spark-mssql-connector_2.12_3.0:1.0.0-alpha and com.microsoft.sqlserver:mssql-jdbc:8.4.1.jre8 did not work for my case as I'm using AWS Glue 3.0 (which is Spark 3.1)

I had to switch to com.microsoft.azure:spark-mssql-connector_2.12:1.2.0 as it's Spark 3.1 compatible.

def write_df_to_target(self, df, schema_table):
    spark = self.gc.spark_session
    spark.builder.config('spark.jars.packages', 'com.microsoft.sqlserver:mssql-jdbc:8.4.1.jre8,com.microsoft.azure:spark-mssql-connector_2.12:1.2.0').getOrCreate()
    credentials = self.get_credentials(self.replica_connection_name)
df.write \
    .format("com.microsoft.sqlserver.jdbc.spark") \
    .option("url", credentials["url"] + ";databaseName=" + self.database_name) \
    .option("dbtable", schema_table) \
    .option("user", credentials["user"]) \
    .option("password", credentials["password"]) \
    .option("batchsize","100000") \
    .option("numPartitions","15") \
    .save()

Last thing. AWS Glue job must have --user-jars-first: "true" param. This instruction indicates that provided jars are to be used in first order (aka - you override default ones).

Try to check if equivalent parameter is on your end.

marcin2x4
  • 1,321
  • 2
  • 18
  • 44