0

I am getting the below error while trying to write to Synapse Dedicated SQL Pool, the dataframe and the target table has same schema, however it seems that the connector can not find some columns in the dataframe while matching the target columns.

Py4JJavaError: An error occurred while calling o2428.synapsesqlforpython.
: com.microsoft.spark.sqlanalytics.SQLAnalyticsConnectorException: Schema from source does not match target table schema.
Match criteria: {Case-sensitive Column Names, DataType and Nullability}.
 Non matching fields from source:
[{Equipment_Cycle_Number, LongType, false},{Equipment_ID, LongType, false},{Equipment_Cycle_Activity_ID, LongType, false},{Equipment_Location_UTM_X, DoubleType, true},{Equipment_Location_UTM_Y, DoubleType, true},{Equipment_Location_UTM_Z, DoubleType, true}] 
 Fields on target: 
 StructType(StructField(Equipment_Cycle_Number,LongType,false), StructField(Activity_Start_Time,TimestampType,false), StructField(Activity_End_Time,TimestampType,false), StructField(Activity_Duration,DoubleType,true), StructField(Equipment_ID,LongType,false), StructField(Equipment_Operator_ID,LongType,true), StructField(Equipment_Cycle_Activity_ID,LongType,false), StructField(Equipment_Travel_Point_DateTime,TimestampType,true), StructField(Equipment_Location_UTM_X,DoubleType,true), StructField(Equipment_Location_UTM_Y,DoubleType,true), StructField(Equipment_Location_UTM_Z,DoubleType,true), StructField(Created_Date,TimestampType,true), StructField(Last_Modified_Date,TimestampType,true), StructField(Activity_Date,TimestampType,true))
    at com.microsoft.spark.sqlanalytics.SqlAnalyticsConnectorClass$SQLAnalyticsFormatWriter.sqlanalytics(SqlAnalyticsConnectorClass.scala:347)
    at com.microsoft.spark.sqlanalytics.SqlAnalyticsConnectorClass$SQLAnalyticsFormatWriter.synapsesql(SqlAnalyticsConnectorClass.scala:191)
    at com.microsoft.spark.sqlanalytics.SqlAnalyticsConnectorClass$SQLAnalyticsFormatWriter.synapsesqlforpython(SqlAnalyticsConnectorClass.scala:203)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.IllegalArgumentException: Schema from source does not match target table schema.

Pyspark Synapse write Code:

df_activity_staging.write\
.option(Constants.SERVER, synapseURL)\
.option(Constants.USER, synapseUser)\
.option(Constants.PASSWORD, synapsePwd)\
.option(Constants.TEMP_FOLDER, "abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/<folder>")\
.option(Constants.STAGING_STORAGE_ACCOUNT_KEY, synapse_storage_account_access_key)\
.mode("overwrite")\
.synapsesql("DB_Name.Schema.Table")

Table Schema

CREATE TABLE [schema].[Table]
(
    [Equipment_Cycle_Number] [bigint] NOT NULL,
    [Activity_Start_Time] [datetime] NOT NULL,
    [Activity_End_Time] [datetime] NOT NULL,
    [Activity_Duration] [float] NULL,
    [Equipment_ID] [bigint] NOT NULL,
    [Equipment_Operator_ID] [bigint] NULL,
    [Equipment_Cycle_Activity_ID] [bigint] NOT NULL,
    [Equipment_Travel_Point_DateTime] [datetime] NULL,
    [Equipment_Location_UTM_X] [float] NULL,
    [Equipment_Location_UTM_Y] [float] NULL,
    [Equipment_Location_UTM_Z] [float] NULL,
    [Created_Date] [datetime] NULL,
    [Last_Modified_Date] [datetime] NULL,
    [Activity_Date] [date] NULL
)
WITH
(
    DISTRIBUTION = ROUND_ROBIN,
    CLUSTERED COLUMNSTORE INDEX
)
GO

I tried using jdbc connector, it works but the writing is very very slow, takes more than an hours writing 100k record

df_4Write.write.jdbc(url=synDSPUrl, table="Schema.Table", mode="overwrite", properties=synDSPConnectionProperties).save()

Use of apache spark connector with staging table is supposed to work smoother, but it seems broken.

Kashyap
  • 15,354
  • 13
  • 64
  • 103
  • Can you try to read a dataframe from the target Synapse table and print it's schema? Just for debugging. Something like `spark.read.synapsesql("DB_Name.Schema.Table").limit(0).printSchema()` I guess. Do the same for source as well, while you're at it. So you have 2 dataframes one each from src and target. And `printSchema()` on both. – Kashyap Mar 01 '23 at 23:45

1 Answers1

0

I tried to reproduce the error an got the python Schema from source does not match target table schema. error as below:

enter image description here

while resolving the error I found out that the schema of data frame and synapse table is not matching perfectly.

As Kashyap said you can get the schema of table with

spark.read.synapsesql("DB_Name.Schema.Table").limit(0).printSchema()

and data frame schema by df.printschema() and compare if any column name is different or not. for me it is different because off it is giving me error.

enter image description here

To resolve this, Correct the schema of the data frame by creating user defined schema and add that schema while reading the data.

#to create userdefined schema
schema = StructType([StructField("Id",IntegerType(),True),StructField("Name",StringType(),True)])

To use our schema in data frame, use this method spark.read.schema("schema") as shown in this document.

Execution and Output:

After correcting schema of dataframe I am able to write the data in dataframe

enter image description here enter image description here

Pratik Lad
  • 4,343
  • 2
  • 3
  • 11
  • The issue was actually datatype mismatch between a dataframe versus sql table, I had toconvert each column of the dataframe to sql datatype while trying to write to Synapse. df_activity_staging\ .withColumn("Equipment_Cycle_Number", col("Equipment_Cycle_Number").cast("Long"))\ .withColumn("Activity_Start_Time", col("Activity_Start_Time").cast("Timestamp"))\ .withColumn("Activity_End_Time", col("Activity_End_Time").cast("Timestamp"))\ – DevFahim Apr 05 '23 at 22:25