0

I had a delta table in databricks and data is available in ADLS. data is partitioned by date column, from 01-06-2022 onwards data is available in parquet format in adls but when i query the table in databricks i can see data from future date onwards everyday.older data is not displaying. every day data is overwriting to the table path with partitioned date column.

David Browne - Microsoft
  • 80,331
  • 6
  • 39
  • 67
Vamsi Krishna
  • 51
  • 1
  • 1
  • 4

1 Answers1

1
df.write.format('delta').mode('overwrite').save('{}/{}'.format(DELTALAKE_PATH, table))

Using Overwrite mode will delete past data and add new data. This is the reason for your issue.

df.write.format('delta').mode('append').save('{}/{}'.format(DELTALAKE_PATH, table))

Using append mode will append new data beneath the existing data. This will keep your existing data and when you execute a query, it will return past records as well.

You need to use append mode in place of overwrite mode.

Append Mode - Only the new rows appended in the Result Table since the last trigger will be written to the external storage. This is applicable only to queries where existing rows in the Result Table are not expected to change.

Reference - https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts

Abhishek K
  • 3,047
  • 1
  • 6
  • 19
  • it's much simpler to use `f'{DELTALAKE_PATH}/{table}'` instead of`'{}/{}'.format(DELTALAKE_PATH, table)` – Alex Ott Aug 24 '22 at 10:17