1

My use case is as follows: A periodic data revision is published (weekly). Records from this data revision either insert new rows into DDB table or update existing records (not all rows are always affected as this update is has a conditional clause). Once the data is ingested, an integration test (automated approval workflow with control/treatment, where control reads older data and treatment reads the new data from revision) follows which touches random rows from the DDB table and validates if the new data revision is not leading to regression (metric threshold check pre v/s post). Once the approval is complete, the data revision is considered valid and moved to production.

Thoughts so far: Maintain versions of the records while inserting the data into the primary table for each revision. Each revision has a unique identifier which becomes the "version" for all the record. Maintain a separate table for storing all the active versions. Only add a version to the active versions table if the approval is successful with an approval timestamp. To determine the active version for a record, we will identify the revision with the latest approval timestamp.

primary table primary_key, version, score

active versions table version, approval_date

The problem with this approach is that the reads will be costly. With increase in number of versions, the reads will require reading more records. This can be solved by adding a cleanup job though. But the cleanup job itself will be a scan operation

Looking for: Ways to optimize this approach/ nuke it and phoenix

TIA!

0 Answers0