Define Data Quality Rules for Big Data

Question

Is there any way to define Data quality rules that can be applied over Dataframes. The template to define the rule should be easy enough for any lay man to define and then we can take these rules and convert them to pyspark codes and run them over the data.

I was thinking in line as below.

ID  ProjectID   RuleID  Attribute1  Value1          Condition1  Attribute2  Value2          Condition2  Type    ModifyAttribute ModificationLogic   CustomUDF
1   1           1       SerialNum   6               EQUAL                                               MODIFY  SerialNum   SUBSTR(serialNum,1,6)   
2   1           2       DriverName  ['A','B','C']   VALUEMATCH  Source      ['D','E','F']   IN          REJECT

If there is any tools or Domain specific language to define the same it would help. If there is any template to define rules which can be applied cross attribute and across multiple tables (join, example country lookup) is also helpful.

This could help: https://github.com/great-expectations/great_expectations — blackbishop, Jan 10 '21 at 21:30

score 1 · Answer 1 · answered Jan 11 '21 at 06:47

1

Surprised no one gave a shot at answering this yet. Typically, for a use case like this, I would use ConfigParser. Based on what your architecture is, you can define sections and rules which can easily be read and executed. But that's something a developer would find easy to use rather than a normal user.

Now that's out of the way, for your use case, as python is a scripting language with a lot of flexibility, you can simply create an excel in the format you have given which will dictate the flow of your data manipulation. I hope this helps in some way. Let me know if you need more info.

answered Jan 11 '21 at 06:47

Jacob Celestine

1,758
13
23

Alternatively you configure these rules in some table (hive/HBase) and then broadcast these during execution or as mentioned by Jacob you can create a config file. Hope this helps. – Divyaansh Bajpai Jan 12 '21 at 05:43
@DivyaanshBajpai Hive is great! It helps with a lot of inconveniences you would face while doing file processing, but in this particular use case, he needs something user friendly. So it might not be a great choice here. – Jacob Celestine Jan 12 '21 at 16:52

Define Data Quality Rules for Big Data

1 Answers1