We are implementing few business flows in financial area. The requirement (unfortunately, not very specific) from the regulatory is to have a data lineage for auditing purpose.
The flow contains 2 parts: synchronous and asynchronous. The syncronous part is a payment attempt containing bunch of info about point of sale, the customer and the goods. The asynchronous part is a batch process that feeds the credit assessment data model with a newly-calculated portion of variables on an hourly basis. The variables might include some aggregations like balances and links to historical transactions.
For calculating the asynchronous part we ingest the data from multiple relational DBs and store them in HDFS in a raw format (rows from tables in csv format).
When storing the data on HDFS is done a job based on Spring XD that calculates some aggregations and produces the data for the synchronous part is triggered.
We have relational data, raw data on HDFS and MapReduce jobs relying on POJOs that describe the relevant semantics and the transformations implemented in SpringXD.
So, the question is how to handle the auditing in the scenario described above? We need at any point in time to be able to explain why a specific decision was made and also be able to explain how each and every variable used in the policy (synchronous or near-real-time flow) was calculated.
I looked at existing Hadoop stack and it looks like currently no tool could provide with a good enterprise-ready auditing capabilities.
My thinking is to start with custome implementation that includes>
- A business glossary with all the business terms
- Operational and technical metadata - logging transformation execution for each entry into a separate store.
- log changes to a business logic (use data from version control where the business rules and transformations are kept).
Any advice or sharing your experience would be greatly appreciated!