my use case is that I want to partition my table on date. On different dates the rows will be appended but if the code is rerun on the same date then it should be overwritten
After looking online it seemed like this task can be done using deltalake's replacewhere feature but I am fine with any solution that involves parquet
I have the following code:
from datetime import date
from pyspark.sql import SparkSession
from pyspark.sql.types import DateType
from pyspark.sql.types import StringType
from pyspark.sql.types import StructField
from pyspark.sql.types import StructType
data = [(date(2022, 6, 19), "Hello"), (date(2022, 6, 19), "World")]
schema = StructType([StructField("date", DateType()),StructField("message", StringType())])
df = spark.createDataFrame(data, schema=schema)
df.write.partitionBy("date").option("replaceWhere", f"date = '2022-06-19'").save(f"/tmp/test", mode="overwrite", format='delta')
df.write.partitionBy("date").option("replaceWhere", f"date = '2022-06-19'").save(f"/tmp/test_3", mode="overwrite", format='delta')
At the second write call the code throws the following exception:
pyspark.sql.utils.AnalysisException: Data written out does not match replaceWhere 'date = '2022-06-19''.
CHECK constraint EXPRESSION(('date = 2022-06-19)) (date = '2022-06-19') violated by row with values:
- date : 17337