The Problem: I am unable to write any dataframe that contains a non-nullable column to Azure Synapse's dedicated SQL pool.
Problem details:
I have a DataFrame with the following schema:
StructType(List(
StructField(key,StringType,false),
StructField(Col_1,StringType,true),
StructField(Col_2,IntegerType,true)))
When I try write this to my Azure Dedicated SQL pool with:
DataFrame.write.mode("overwrite").synapsesql("DB.SCHEMA.TEST_TABLE")
I receive the following error:
Py4JJavaError Traceback (most recent call last)
<ipython-input-65-362788a> in <module>
2 #Failed to validate staging directory path. Verify if there's external user action to cancel the staging spark job.
Band-aid Solution:
Using a code snippet from this Stack Question
def set_df_columns_nullable(spark, df, column_list, nullable=True):
for struct_field in df.schema:
if struct_field.name in column_list:
struct_field.nullable = nullable
df_mod = spark.createDataFrame(df.rdd, df.schema)
return df_mod
I am able to modify the "key" column to "nullable" after which the above write function will happily write the data to the table.
The Question:
What is the explanation here? Why won't the Synapse table accept non-nullable data from Spark?