Data ingestion with Kafka and Hadoop - how to avoid data duplication that can result from quality check failure?

Question

Here is a simplified scenario:

N business flows that need the same raw data from the same source. The data is ingested using Kafka (normal Kafka pipelines) and landed on HDFS where the automatic flow of quality checking is triggered on the raw data for every flow. All N flows might have diferent quality standards of the data. For example, they might require that diferrent formats of the date and time could be applied on the raw data when it is transformed to their desired schemas.

What is the best approach to handle failure to meet the KPIs of the quality tests of a business flow?

The options are:

Fail all - notify the source data provider and wait for the fixed data. Then re-ingest and run all N sets of quality checks.
Create a branch - meaning that K out of N business flows that did not pass the quality checks will be waiting for their fixed data set while N-K that passed will work with current data set.
Flag the entries that did not pass the quality checks for certain business flows and put them in special queue to be handled/fixed manually. Apply some rules and thresholds for the amount of bad entries (Just in terms of the awareness of the capacity of the team that will need to go through this queue and analyze and fix the problematic entries)

What approach (if any) of the above is the most sensible? Are there any patterns or best practices of handling the situation when same data is used by many consumers who have diferent quality standards? Ideally, I would avoid duplicating the same data meaning that re-ingesting the fixed data set for every consumer (N - is not the worst case because a fix for one out of N might result in problems for the rest that were OK beforehand. So, theorethically, this process could be endless).

score 1 · Answer 1 · answered Apr 27 '17 at 13:35

We

Ingest data in hbase
Never delete data, however duplicates are versioned by hbase.
Run an hourly export from hbase to a partitioned hive table (hbase latest version only)
Partitioning is based on ingestion timestamp in hbase. This means partitioning is deterministic and partitions are closed.
The hourly export may route to multiple tables based on rules. This could be because of multiple schemas in the same topic/silo or it could be due to invalid message in which case it is passed to a dead letter table.

This adds a an hour of latency for consumers, but it gives us the power to route and reroute messages into deterministic closed partitions. For our customers, 1 hour of latency is not an issue.

We may run multiple export jobs for the same topic/silo depending on different consumers requirements.

Sorry, I realize I did not really answer your question. However the export job may report or tag your invalid messages. I guess 1, 2 or 3 depends on the quality requirements of the pipeline versus latency requirements. — Pelle, Apr 27 '17 at 13:38
Thanks @Pelle! Sounds like a sensible approach. If no better option suggested we will go for your solution. Welcome to SO BTW! )) — aviad, Apr 28 '17 at 06:29

Data ingestion with Kafka and Hadoop - how to avoid data duplication that can result from quality check failure?

1 Answers1