2

In this command(taken from) would replaceWhere cause deletion of records? eg: the date ranged mentioned in the command had 1000 rows. New df has only 100. Would this cause a deletion of 900 records?

df.write \
  .format("delta") \
  .mode("overwrite") \
  .option("replaceWhere", "date >= '2017-01-01' AND date <= '2017-01-31'") \
  .save("/mnt/delta/events")
Kafels
  • 3,864
  • 1
  • 15
  • 32
Blue Clouds
  • 7,295
  • 4
  • 71
  • 112
  • Same question answered here: https://stackoverflow.com/questions/59851167/spark-delta-overwrite-a-specific-partition/65305467#65305467 – Ali Hasan May 15 '21 at 09:13

1 Answers1

4

This option works almost like a dynamic overwrite partition, basically you are telling Spark to overwrite only the data that is on those range partitions. In addition, data will be saved only if your dataframe matches the condition replaceWhere, otherwise, if a single row does not match, an exception Data written out does not match replaceWhere will be thrown.

Q: Would this cause a deletion of 900 records?
A: Yes, it would delete.

I did a test creating one dataframe with 2 columns

root
 |-- number: long (nullable = true)
 |-- even: integer (nullable = true)

The first run will save 1000 rows, where 500 are even and 500 are odd:

rows = [Row(number=i) for i in range(0, 1000)]

df = spark.createDataFrame(rows)
df = df.withColumn('even', (f.col('number') % 2 == f.lit(0)).cast('int'))

(df
 .write
 .partitionBy('even')
 .format('delta')
 .saveAsTable('my_delta_table'))

first dataframe output

The second run will filter only even rows and overwrite partition where even=1:

rows = [Row(number=i) for i in range(0, 10)]

df_only_even = spark.createDataFrame(rows)
df_only_even = df_only_even.withColumn('even', (f.col('number') % 2 == f.lit(0)).cast('int'))

# It is required to filter your dataframe or will throw an error during write operation
df_only_even = df_only_even.where(f.col('even') == f.lit(1))

(df_only_even
 .write
 .partitionBy('even')
 .format('delta')
 .option('replaceWhere', 'even == 1')
 .mode('overwrite')
 .saveAsTable('my_delta_table'))

second dataframe output

Result

My table named my_delta_table has 505 rows, where 500 are odd and 5 are even:

final table result

Kafels
  • 3,864
  • 1
  • 15
  • 32
  • Here's my link where I took the test: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2012413774589224/2923545371365447/476673152156399/latest.html – Kafels May 15 '21 at 03:20
  • "Would this cause a deletion of 900 records?" What you think the answer will be? – Blue Clouds May 15 '21 at 09:09
  • 2
    @BlueClouds I did everything and forgot to answer directly. Yes, it would delete. – Kafels May 15 '21 at 13:06