Reconciliation of low quality data: view vs (table + scheduled procedures)

Question

TL;DR: How to build a consistent, proper view from bad, moving data that is reconciliated according to moving rules?

Hello all :) I'm building a database where data has to be transformed, reconciliated when possible and enriched heavily.

(btw, If you know a good book on this general topic, please post!)

In my particular case, the raw data comes from two tables.

Here is what happens on the raw tables:

Unformatted data (to be formatted: phone numbers, emails to lower case, etc.)
Missing data: Some fields are missing and could be enriched in the future
Updated data: Some lines are updated
New data: New lines are inserted

The reconciliation part uses fields (IDs, postal addresses, etc.) that can be partial, missing in the table or not found on the other table. Several reconciliation rules are used. All can be expressed in SQL. Some of them are using GROUP BYs (which prevent a view from being updatable).

The constraints are:

Underlying tables can be updated. The new data has to follow suit at most 72h later.
The new data has to be in queryable form (table or view or materialized view), with acceptable performance.
The lines in the new data must be somewhat consistent over time, so that one can annotate them (to enrich them, to mark them as having been sent elsewhere, etc.)
Some lines in the new data will have to be enriched manually.
Lines in the new data have to contain the time a which they were reconciliated (as well as time of update if they were updated)
The method of reconciliation can be updated. The new data has to follow suit at most 72h later.

I can't decide between using a view, or doing it with a table that is updated with a stored procedure.

A view would handle well the changes in raw data and updates in the reconciliation rules. But it would not support the annotation of the new lines.

A table updated with a stored procedure would handle that well, but would necessitate some complex handling when the reconciliation rules are changed, or when the raw data is updated.

I was thinking maybe using a view and have a table on the side, with the table's primary key being a hash of some stable fields of the new data.

The tools are: Oracle 10g (plus Java, if needed)

Sorry for the wall of text.

The question is : what would you do?

score 0 · Answer 1 · answered Feb 15 '13 at 13:41

0

Insert all the data from your raw tables inside 2 or more other tables. The new tables are properly designed in terms of relational theory, that will allow efficient querying on them.

answered Feb 15 '13 at 13:41

Plouf

627
3
7

Thanks. The problem of efficiency is not my primary concern here. What concerns me most is the quality of data that is propagated through rules that can change over time. – BenoitParis Feb 15 '13 at 13:46

Reconciliation of low quality data: view vs (table + scheduled procedures)

1 Answers1