Hope you are doing well !
We have already developed ETL pipeline using apache NiFi. Which gets trigger only when client uploads source data file from portal.After that, the data present inside source file goes through various layers,gets transformed and stored back to warehouse(i.e. hive).
Goal : To identify sensitive information and mask it so that end user won't see actual data.
Identify Sensitive data & masking strategy : We will make use of open source tool to achieve this goal as follow.
Data steward studio : This tool allow me to identify sensitive information and tag it properly.
Apache Atlas : Once data steward user has confirmed the tag then that tag will be pushed into Apache atlas.
Apache ranger : At the final, we can define tag based-masking policy using Apache ranger which will allow or deny to specific user.
For more details on above solution , please visit link.
https://www.youtube.com/watch?v=RzEfLwJaLsc
Problem : In order to feed the data to DSS tool, it should be loaded first in hive table. That is fine. But we cannot stop the existing ETL flow in-between and then start identification process of sensitive information. The above solution must require some manual process which i want to get rid of and make it automated.that is, it should be plugged in somewhere within NiFi pipeline.But so far, as per my understanding DSS do not allow us to do something like that.
Manual Process :
Create Asset collection
Accept/Reject suggested tags within DSS.
If we cannot plug identification process in pipeline, then client sensitive data will be exposed to everyone and visible to everyone in team. I want something where we can de-identify sensitive data before it actually get loaded into HDFS or hive tables.
Please write your response to me on the same problem, if anyone has already worked into this particular area.