Facing Issue in writing PySpark Dataframe to Azure Synapse

Question

I have a PySpark dataframe in Azure Databricks. I want to write into Azure Synapse. But i am getting below error.

com.microsoft.sqlserver.jdbc.SQLServerException: The statement failed. Column &#39;ETL_TableName&#39; has a data type that cannot participate in a columnstore index.

I checked connection for Synapse .All works fine and i am able to read the data. But While writing , i am getting issue . Could anyone please help how to handle this error.

Code For Writing data into Synapse:

dataFrame.repartition(1).write.format("jdbc")\
         .option("url", azureurl)\
         .option("tempDir", tempDir) \
         .option("forwardSparkAzureStorageCredentials", "true") \
         .option("dbTable", dbTable)\
         .option("append", "true")\
         .save()

score 1 · Answer 1 · answered Jul 03 '20 at 09:00

1

Couple of things needs to be changed.

Format should be .format("jdbc") => .format("com.databricks.spark.sqldw").

Add this option "tableOptions" clause to your write statement. It takes the place of the with() clause of the CREATE TABLE (AS) statement:

.option ("tableOptions","heap,distribution=MY_DISTRIBUTION")

Code should looks like this:

dataFrame.repartition(1).write.format("com.databricks.spark.sqldw")\
         .option("tableOptions","heap,distribution=HASH(rownum)")
         .option("url", azureurl)\
         .option("tempDir", tempDir) \
         .option("forwardSparkAzureStorageCredentials", "true") \
         .option("dbTable", dbTable)\
         .option("append", "true")\
         .save()

Reference:

Azure Databricks - Azure Synapse Analytics

Choose a value for MY_DISTRIBUTION based on the following guidance:

Guidance for designing distributed tables in Synapse SQL pool

answered Jul 03 '20 at 09:00

CHEEKATLAPRADEEP

12,191
1
19
42

Hi Pradeep . thanks for replying ...but unfortunately both of suggestion did not work for me.For first suggestion when we changed the format from jdbc to databricks.spark as above it was not even connecting to db . So we had to revert. Second Sugestion how to implement value for MY_DISTRIBUTION in my case . – HimanshuSPaul Jul 06 '20 at 13:45
This is not be the exact answer of above issue . But it may help someone ."We still have no clue of above issue . But i noticed this issue is coming when we are trying to write into Azure Synapse Warehouse . – HimanshuSPaul Jul 06 '20 at 13:57
As i don't have any strict reason for sticking to Synapse Warehouse and whole priority was to write data from Databricks to Azure in structured format , i replaced Azure Synapse warehouse with Azure SQL Server database – HimanshuSPaul Jul 06 '20 at 13:57
And its working much better.Will update the answer once i found the actual reason behind the issue. – HimanshuSPaul Jul 06 '20 at 13:58
Glad to know you able to resolve the issue. Please do share the solution, which might be beneficial to other community members reading this thread. – CHEEKATLAPRADEEP Jul 07 '20 at 04:36
Hi @Paul Could you please update the answer and accept it as answer? – CHEEKATLAPRADEEP Jul 14 '20 at 07:34

score 0 · Answer 2 · answered Jul 06 '20 at 13:56

This is not be the exact answer of above issue . But it may help someone to get over . "I still have no clue about reason behind above issue" . But i noticed this issue is coming when trying to write into Azure Synapse Warehouse. As i don't have any strict reason for sticking to Synapse Warehouse and whole priority was to write data from Databricks to Azure in structured format , i replace Azure Synapse warehouse with Azure SQL Server database .And its working much better.Will update the answer once i found the actual reason behind the issue.

Facing Issue in writing PySpark Dataframe to Azure Synapse

Code For Writing data into Synapse:

2 Answers2