Cannot write non-nullable data to Azure Synapse - PySpark

Question

The Problem: I am unable to write any dataframe that contains a non-nullable column to Azure Synapse's dedicated SQL pool.

Problem details:

I have a DataFrame with the following schema:

StructType(List(
StructField(key,StringType,false),
StructField(Col_1,StringType,true),
StructField(Col_2,IntegerType,true)))

When I try write this to my Azure Dedicated SQL pool with:

DataFrame.write.mode("overwrite").synapsesql("DB.SCHEMA.TEST_TABLE")

I receive the following error:

Py4JJavaError                             Traceback (most recent call last)
<ipython-input-65-362788a> in <module>
      2 #Failed to validate staging directory path. Verify if there's external user action to cancel the staging spark job.

Band-aid Solution:

Using a code snippet from this Stack Question

def set_df_columns_nullable(spark, df, column_list, nullable=True):
    for struct_field in df.schema:
        if struct_field.name in column_list:
            struct_field.nullable = nullable
    df_mod = spark.createDataFrame(df.rdd, df.schema)
    return df_mod

I am able to modify the "key" column to "nullable" after which the above write function will happily write the data to the table.

The Question:

What is the explanation here? Why won't the Synapse table accept non-nullable data from Spark?

hi @SheerKahn, maybe in your scenario **key** column in the data frame was defined as non-nullable, which causes the write operation to fail. By modifying the Key column to nullable, You allowed the write operation to get success. — B. B. Naga Sai Vamsi, Feb 03 '23 at 11:32
Hi @SaiVamsi. Yes, that is exactly what is happening. My question is why? The key column should not contain nulls and the DB Table has been set up accordingly. I should be able to write a non-nullable column? — SheerKahn, Feb 06 '23 at 01:11

Cannot write non-nullable data to Azure Synapse - PySpark

0 Answers0