How to implement Data Lineage on Hadoop?

Question

We are implementing few business flows in financial area. The requirement (unfortunately, not very specific) from the regulatory is to have a data lineage for auditing purpose.

The flow contains 2 parts: synchronous and asynchronous. The syncronous part is a payment attempt containing bunch of info about point of sale, the customer and the goods. The asynchronous part is a batch process that feeds the credit assessment data model with a newly-calculated portion of variables on an hourly basis. The variables might include some aggregations like balances and links to historical transactions.

For calculating the asynchronous part we ingest the data from multiple relational DBs and store them in HDFS in a raw format (rows from tables in csv format).

When storing the data on HDFS is done a job based on Spring XD that calculates some aggregations and produces the data for the synchronous part is triggered.

We have relational data, raw data on HDFS and MapReduce jobs relying on POJOs that describe the relevant semantics and the transformations implemented in SpringXD.

So, the question is how to handle the auditing in the scenario described above? We need at any point in time to be able to explain why a specific decision was made and also be able to explain how each and every variable used in the policy (synchronous or near-real-time flow) was calculated.

I looked at existing Hadoop stack and it looks like currently no tool could provide with a good enterprise-ready auditing capabilities.

My thinking is to start with custome implementation that includes>

A business glossary with all the business terms
Operational and technical metadata - logging transformation execution for each entry into a separate store.
log changes to a business logic (use data from version control where the business rules and transformations are kept).

Any advice or sharing your experience would be greatly appreciated!

please read the tag text for enterprise-architect before re-rollbacking. It refers to the UML modelling tool from Sparx Systems, not to the architecture role. If this question is somehow related to the UML tool, please explain how. — Uffe, May 31 '16 at 07:22
@Uffe sorry got confused here. thought it was enterprise-architecture. BTW it was just a rollback not a re-rollback :) — aviad, May 31 '16 at 14:53

score 0 · Answer 1 · answered May 02 '19 at 23:01

Currently Cloudera sets the industry standard for Data Lineage/Data Governance in the big data space.

Glossary, metadata and historically run (versions of) queries can all be facilitated.

I do realize some of this may not have been in place when you asked the question, but it certainly is now.

Disclaimer: I am an employee of Cloudera

How to implement Data Lineage on Hadoop?

1 Answers1