0

I have a Databricks environment and I need to create a real-time log table that contains all instances where any delta table in my hive metastore changes. CREATE, ALTER, INSERT, DELETE, any change to the table.

I need this to serve as a trigger to then update/refresh downstream reporting. That way data is connected and consistent from start to finish.

Is this data captured centrally in Databricks? How would you go about setting this up?

I tried to find clear answers but there are too many options that don't work

Haze
  • 1

1 Answers1

0

This sounds like a good use case for Delta Lake's change data feed. A Delta Table's change data feed (CDF) supports tracking row-level changes between versions of a Delta table, including when a row was inserted, deleted, updated, etc.

You can enable CDF for a Delta Table by following instructions here:

  • Existing table: Set the table property delta.enableChangeDataFeed = true in the ALTER TABLE command:

    ALTER TABLE myDeltaTable SET TBLPROPERTIES (delta.enableChangeDataFeed = true)

  • All new tables:

    set spark.databricks.delta.properties.defaults.enableChangeDataFeed = true;

This should cover everything except for table creation. You can monitor directory creation where you store your Delta Tables or if you know where your tables are located, you can monitor for new tables and read those table's transaction logs.

Jim Hibbard
  • 205
  • 1
  • 6