Here is a simplified scenario:
N business flows that need the same raw data from the same source. The data is ingested using Kafka (normal Kafka pipelines) and landed on HDFS where the automatic flow of quality checking is triggered on the raw data for every flow. All N flows might have diferent quality standards of the data. For example, they might require that diferrent formats of the date and time could be applied on the raw data when it is transformed to their desired schemas.
What is the best approach to handle failure to meet the KPIs of the quality tests of a business flow?
The options are:
- Fail all - notify the source data provider and wait for the fixed data. Then re-ingest and run all N sets of quality checks.
- Create a branch - meaning that K out of N business flows that did not pass the quality checks will be waiting for their fixed data set while N-K that passed will work with current data set.
- Flag the entries that did not pass the quality checks for certain business flows and put them in special queue to be handled/fixed manually. Apply some rules and thresholds for the amount of bad entries (Just in terms of the awareness of the capacity of the team that will need to go through this queue and analyze and fix the problematic entries)
What approach (if any) of the above is the most sensible? Are there any patterns or best practices of handling the situation when same data is used by many consumers who have diferent quality standards? Ideally, I would avoid duplicating the same data meaning that re-ingesting the fixed data set for every consumer (N - is not the worst case because a fix for one out of N might result in problems for the rest that were OK beforehand. So, theorethically, this process could be endless).