0

I am using Azure Databricks with latest runtime for the clusters. I had some confusion regarding VACUUM operation in delta lake. We know we can set a retention duration on the deleted data, however, for actual data to be delete after the retention period is over, do we need to keep the Cluster Up for the entire duration?

In simple words-: Do we need to have Cluster always in running state in order to leverage Delta lake ?

2 Answers2

1

You don't need to always keep a cluster up and running. You can schedule a vacuum job to run daily (or weekly) to clean up stale data older than the threshold. Delta Lake doesn't require an always-on cluster. All the data/metadata are stored in the storage (s3/adls/abfs/hdfs), so no need to keep anything up and running.

CHEEKATLAPRADEEP
  • 12,191
  • 1
  • 19
  • 42
Joe Widen
  • 2,378
  • 1
  • 15
  • 21
-2

Apparently you need a cluster to be up and running always to query for the data available in databricks tables.

If you have configured the external meta store for databricks, then you can use any wrappers like apache hive by pointing it to that external meta store DB and query the data using hive layer without using databricks.

Shane
  • 588
  • 6
  • 20
  • this also sounds a good way for querying data out of data lake. However, what about performance, is it as good as databricks? or maybe comparable to that? – Anish Sarangi Jan 06 '21 at 12:56