Is there any way to define Data quality rules that can be applied over Dataframes. The template to define the rule should be easy enough for any lay man to define and then we can take these rules and convert them to pyspark codes and run them over the data.
I was thinking in line as below.
ID ProjectID RuleID Attribute1 Value1 Condition1 Attribute2 Value2 Condition2 Type ModifyAttribute ModificationLogic CustomUDF
1 1 1 SerialNum 6 EQUAL MODIFY SerialNum SUBSTR(serialNum,1,6)
2 1 2 DriverName ['A','B','C'] VALUEMATCH Source ['D','E','F'] IN REJECT
If there is any tools or Domain specific language to define the same it would help. If there is any template to define rules which can be applied cross attribute and across multiple tables (join, example country lookup) is also helpful.