2

Can someone help me define a data quality framework to analyze some sort of data ? Just a high level description of what it is supposed to do? Just your thoughts on it.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129

2 Answers2

3

The six measures I consider for data quality are - Accuracy, Completeness, Integrity, Precision, Validity and Timeliness. I recently completed an article with some high-level testing approaches here https://www.eageranalyst.com/blog/2019/2/27/data-quality-beyond-accuracy-and-completeness

Accuracy

Find an external trusted data source to validate against, if one is not available, find a proxy that you can compare against. For example, you might not be able to compare individual transactions against another source, but maybe you can compare them to the aggregate total or end of day values.

Completeness

If you have a unique ID or reference key, check if you have more than one record for that key to identify duplicate values.

If you have an external source of truth, reconcile your ID/reference key against that source to identify missing values or extra ones.

If you have a checksum or control total, confirm that your dataset has that many records or computes to the same checksum.

For column completeness, identify which columns should not be missing values and analyze any records that are missing data in those columns.

If you don’t have a reference number, consider other things you could use as a proxy to indicate if your data is over or under reported. For example, the volume of transactions per day, the aggregate value of transactions in a month or per user.

If you don’t have a reference number, consider what combination of attributes you could use to identify duplicates in your data. For example, email address, physical address or phone number paired with a name or date of birth.

Integrity

Try to obtain a data dictionary for any data provided so that there is a clear explanation of what each column is expected to contain. Compare these definitions across sources.

Where you have two sources, compare attributes between the sources to identify potential omissions or inaccuracies and define which you consider to be the authoritative source of that data.

Precision

Examine the values you have - do they cluster around specific values that may indicate they are being rounded to a less granular level?

Be clear about the reasons why you have a particular level of precision for a recorded value. What level of precision do you need for your calculations?

Validity

Where you’re expecting categorical values, check that your data falls within those categories.

If you require certain fields to be present, check they are populated.

For free text fields, analyze field lengths to look for instances where dummy data has been recorded.

Timeliness

Look for expiration dates or last updated dates on key information fields. Things like contact information will inevitably change over time.

Check when a report was last refreshed or prepared.

eageranalyst
  • 1,016
  • 9
  • 15
0

The answer depends on the context that you're interested in (e.g. to support a particular domain or use a particular technology) ... hopefully these high level notes are of some use.

So, given that decisions made on data assume a level of quality of that data -- for example, making strategic decisions using aggregate "Management Information", or for automated processing -- then a data quality framework may look to assess that data against a number of measures, such that the level of confidence in the quality of the data is understood.

In practical terms, each data attribute can be assessed for measurable things:

  1. if it is mandatory, is it populated;
  2. if it is represented by a specific data type (e.g. date, number), does it match that definition;
  3. does the attribute meet business rules
    • simple rules, like maximum/minimum values;
    • more complex rules, like limits on day-to-day changes in value;
    • aggregation rules across a set of values;
  4. cross-attribute rules, where values are dependent upon other attribute's values;
  5. if it is stored in multiple places, is it consistent;
  6. is it the right value.

Items 1-5 could be automated (e.g. using SQL if you're using databases, or a bespoke DQ package). The last item may rely on sampling (as you may pass all the rules but just have the wrong value).

Once you have assessed your data using the technical element of your framework, then there may well be another "governance" angle too, e.g. a data quality forum, to assess the impact of non-compliant elements, prioritise them, track remediation, publish details, etc..

richaux
  • 2,622
  • 2
  • 35
  • 40