2

I want to have Incremental Load Pattern for a Source System where there is no Audit Fields which state when was the record last modified. Example: Lasted Modified on (date time)

But these tables are defined with Primary Keys and Unique Keys which are used by the application to updated the record when ever there is any any change in the attribute.

Now question is how can i determine Delta's every day and load them into Azure Data Lake using Azure Data Factory / Databricks.

Should I stage full set of data from current day and current day -1 and determine delta's by using hash values ?

Or there is a better way?

Sreedhar
  • 29,307
  • 34
  • 118
  • 188
  • How about engaging change data capture (CDC) on the source system? Some vendors have it built in eg SQL Server CDC or use an agent like Attunity Replicate, GoldenGate, custom triggers etc – wBob Mar 25 '20 at 22:48
  • We don't have control to this Source System. Just got read access to one of the schema. – Sreedhar Mar 25 '20 at 23:30

1 Answers1

0

As this database is not huge in size, ended up creating pipeline where it loads full dataset into sql staging and then writes back to Data Lake into relevant location for Initial Load Dataset and then promotes the sql staging to PreviousDay schema.

There on for next incremental it reads full dataset into sql staging and then compares with PreviousDays dataset, get the changed records and writes to Data Lake into relevant incremental location. Then drops the existing PreviousDay dataset and promote Staging dataset to previousDay so that is ready for next incremental.

Sreedhar
  • 29,307
  • 34
  • 118
  • 188