One of our clients has a requirement to build/develop data quality rules using hiveQL.
E.g, Replace NULL values, Change date format in YYYY-MM-DD
, Standardize amount column values in US & EU format, etc.
Problem Statement:
I have the set of data quality rules in one hive table(dq_rules), want to execute each rule one by one and store the errors(the data issues such as null column, incorrect date format column) in another hive table(dq_logging) for reporting/logging purpose.
Please Suggest me solution by keeping one thing in mind that, I want to make this solution generic and executable for any hive table/columns(It means it should be parameterized).
Restriction: I cannot use existing Data Quality tools. I need to complete it using a hive only(Restriction is given by Client).
Schema for Tables:
- dq_rules => Validation Rule ID,Rule Category,DQ Dimension,Rule Description Date Added,Date Retired
- dq_logging => Error_ID,Source_Name,Erroneous_Source_Fields,Source_File_Record,Validation Rule ID
If anyone has a solution related to writing shell/python script that will also work for me. I just need to make it end to end process.