1

We have been using Azure Databricks / Delta lake for the last couple of months and recently have started to spot some strange behaviours with loaded records, in particular latest records not being returned unless the cluster is restarted or a specific version number is specified.

For example (returns no records)

df_nw = spark.read.format('delta').load('/mnt/xxxx')
display(df_nw.filter("testcolumn = ???"))

But this does

%sql

SELECT * FROM delta.`/mnt/xxxx` VERSION AS OF 472 where testcolumn = ???

As mentioned above this only seems to be effecting newly inserted records. Has anyone else come across this before?

Any help would be appreciated.

Thanks Col

2 Answers2

0

Check to see if you've set a staleness limit. If you have, this is expected, if not, please create a support ticket.

https://docs.databricks.com/delta/optimizations/file-mgmt.html#manage-data-recency

Joe Widen
  • 2,378
  • 1
  • 15
  • 21
  • Thanks for this, I have looked into the post provided and have run the following `spark.conf.get("spark.databricks.delta.stalenessLimit")` and confirmed our staleness limit is set to '0ms'. I will raise a support ticket and report back with their findings. – Colin Olliver Sep 07 '21 at 08:59
0

Just in case anyone else is having a similar problem, I thought it would be worth sharing the solution I accidentally stumbled across. Over the last week I was encountering issues with our Databricks cluster, whereby the spark drivers kept crashing with resource intensive workloads. After a lot of investigations, it turned out that our cluster was in Standard (Single User) mode. So, I spun up a new High Concurrency cluster. The issue was still occasionally appearing on the High Concurrency cluster, so I decided to flip the notebook to the old cluster, which was still in an active state, and the newly loaded data was there to be queried. This led me to believe that Databricks / Spark Engine was not refreshing the underlying data set and using a previously cached version of the data even though I hadn’t explicitly cached the underlying data set.

By running %sql CLEAR CACHE the data appeared as expected.