How to plug in a process of identifying sensitive information somewhere in ETL pipeline?

Question

Hope you are doing well !

We have already developed ETL pipeline using apache NiFi. Which gets trigger only when client uploads source data file from portal.After that, the data present inside source file goes through various layers,gets transformed and stored back to warehouse(i.e. hive).

Goal : To identify sensitive information and mask it so that end user won't see actual data.

Identify Sensitive data & masking strategy : We will make use of open source tool to achieve this goal as follow.

Data steward studio : This tool allow me to identify sensitive information and tag it properly.
Apache Atlas : Once data steward user has confirmed the tag then that tag will be pushed into Apache atlas.
Apache ranger : At the final, we can define tag based-masking policy using Apache ranger which will allow or deny to specific user.

For more details on above solution , please visit link.

https://www.youtube.com/watch?v=RzEfLwJaLsc

Problem : In order to feed the data to DSS tool, it should be loaded first in hive table. That is fine. But we cannot stop the existing ETL flow in-between and then start identification process of sensitive information. The above solution must require some manual process which i want to get rid of and make it automated.that is, it should be plugged in somewhere within NiFi pipeline.But so far, as per my understanding DSS do not allow us to do something like that.

Manual Process :

Create Asset collection
Accept/Reject suggested tags within DSS.

If we cannot plug identification process in pipeline, then client sensitive data will be exposed to everyone and visible to everyone in team. I want something where we can de-identify sensitive data before it actually get loaded into HDFS or hive tables.

Please write your response to me on the same problem, if anyone has already worked into this particular area.

score 0 · Answer 1 · answered Jul 02 '20 at 06:46

I did not test it, but here are my thoughts on this challenge.

Set up the system such that data is NOT visible to everyone(or anyone) by default
Load the data into hive
Let the profilers run and accept its suggestions
Open up the data to those who should have access (except for the things found by the profiler)

There are still some implementation details to work out (e.g. How to automate step 3/4 and whether you can just solve this with tags or whether the data needs to sit in a staging area first). But I hope this steers you in a good direction.

Thanks for your reply @Dennis. Yes, you are right how to automate 3 & 4 steps in ETL pipeline.That's the only challenge for me because DSS does not provide any functionality like that. And what do you meant by setting up system? Would you please elaborate more on that. — Manoj Dhake, Jul 03 '20 at 05:07

score 0 · Answer 2 · answered Aug 24 '20 at 06:06

One idea might be to use EncryptContent of nifi (https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.processors.standard.EncryptContent/). Then the values loaded into Hive will be encrypted in the first place and would not be visible to the stewards. Once the tagging has been done - then in the subsequent part of the pipeline (where I'm assuming you're using nifi as well) - you can decrypt back content as required.

How to plug in a process of identifying sensitive information somewhere in ETL pipeline?

2 Answers2