1

One of our clients has a requirement to build/develop data quality rules using hiveQL. E.g, Replace NULL values, Change date format in YYYY-MM-DD, Standardize amount column values in US & EU format, etc.

Problem Statement:

I have the set of data quality rules in one hive table(dq_rules), want to execute each rule one by one and store the errors(the data issues such as null column, incorrect date format column) in another hive table(dq_logging) for reporting/logging purpose.

Please Suggest me solution by keeping one thing in mind that, I want to make this solution generic and executable for any hive table/columns(It means it should be parameterized).

Restriction: I cannot use existing Data Quality tools. I need to complete it using a hive only(Restriction is given by Client).

Schema for Tables:

  1. dq_rules => Validation Rule ID,Rule Category,DQ Dimension,Rule Description Date Added,Date Retired
  2. dq_logging => Error_ID,Source_Name,Erroneous_Source_Fields,Source_File_Record,Validation Rule ID

If anyone has a solution related to writing shell/python script that will also work for me. I just need to make it end to end process.

A. Nadjar
  • 2,440
  • 2
  • 19
  • 20
Manoj Dhake
  • 227
  • 4
  • 16

0 Answers0