I would like some ideas on the following problem.
The System
In a customer-centric system, the Customer_id
is a right-aligned 9-digit code with 7 leading spaces.
A valid entry would be ' 000000399
'.
There is a master CUSTOMERS
table and nearly 80 detail tables.
The master table CUSTOMERS
has multiple records for each Customer_id
and uses columns Start_date
, End_date
for record history.
The End_date is initially '9999
' and is updated to the Start_date
of a new record whenever a new record for an existing Customer_id
is inserted.
An example follows. The first record for Customer_id ' 000000399
' would be:
+--------------------+---------------------+---------------------+ | Cust_id | Start_date | End_date | +--------------------+---------------------+---------------------+ | ' 000000399' | '20100225210300000' | '9999 ' | +--------------------+---------------------+---------------------+
Lets assume that a new record in inserted at 2010-03-03 12:37:33,000
The new record should have an End_date of '9999' and the previous record's End_date should be updated with the new Start_date.
The records of this Customer_id should be:
+--------------------+---------------------+---------------------+ | Cust_id | Start_date | End_date | +--------------------+---------------------+---------------------+ | ' 000000399' | '20100225210300000' | '20100303123733000' | | ' 000000399' | '20100303123733000' | '9999 ' | +--------------------+---------------------+---------------------+
This pattern is followed in the master and the detail tables that use a Start_date and an End_date columns.
The Problem
At some point in time Customer_ids
started being inserted in wrong formats. You can find the following formats for the same Customer.
' 000000399'
'000000399 '
' 000000399 '
The problem is that these Customer_ids
are treated as different Customers from the system, so they maintain their own chains of Start_date
and End_date
.
We end up having the following records for the customer (sorted by Start_date
).
+---------+-------------+-------------------+-------------------+-------------------+
| Cust_id | New_Cust_id | Start_date | End_date | New_End_date |
+---------+-------------+-------------------+-------------------+-------------------+
| 399 | | 20100225210300000 | 20100303123733000 | |
+---------+-------------+-------------------+-------------------+-------------------+
| 399 | | 20100303123733000 | 20110127001826000 | |
+---------+-------------+-------------------+-------------------+-------------------+
| 399 | | 20110127001826000 | 20110129001706000 | |
+---------+-------------+-------------------+-------------------+-------------------+
| 399 | | 20110129001706000 | 20120309001920000 | |
+---------+-------------+-------------------+-------------------+-------------------+
| 399 | | 20120309001920000 | 20120406001926000 | |
+---------+-------------+-------------------+-------------------+-------------------+
| 399 | | 20120406001926000 | 20120524001910000 | |
+---------+-------------+-------------------+-------------------+-------------------+
| 399 | | 20120524001910000 | 20130129180350000 | |
+---------+-------------+-------------------+-------------------+-------------------+
| 399 | | 20130129180350000 | 20131212153024000 | |
+---------+-------------+-------------------+-------------------+-------------------+
| 399 | | 20131212153024000 | 20150424184238000 | |
+---------+-------------+-------------------+-------------------+-------------------+
| 399 | | 20150424184238000 | 20160222184547000 | 20151103095530339 |
+---------+-------------+-------------------+-------------------+-------------------+
| 399 | 399 | 20151103095530339 | 20180714105242000 | 20160222184547000 |
+---------+-------------+-------------------+-------------------+-------------------+
| 399 | | 20160222184547000 | 20180911143426000 | 20180714105242000 |
+---------+-------------+-------------------+-------------------+-------------------+
| 399 | 399 | 20180714105242000 | 9999 | 20180911143426000 |
+---------+-------------+-------------------+-------------------+-------------------+
| 399 | | 20180911143426000 | 9999 | |
+---------+-------------+-------------------+-------------------+-------------------+
The Challenge
We face the challenge of cleaning the data so that there are no Customer_ids deviations and that current data keep their integrity and consistency.
In the above example the solution is to inject the records with the faulty Customer_ids
in the correct position (Start_date/End_date
) and with the correct Customer_id
format.
Lets add some new columns for the columns that need to get updated.
+---------+-------------+-------------------+-------------------+-------------------+
| Cust_id | New_Cust_id | Start_date | End_date | New_End_date |
+---------+-------------+-------------------+-------------------+-------------------+
| 399 | | 20100225210300000 | 20100303123733000 | |
+---------+-------------+-------------------+-------------------+-------------------+
| 399 | | 20100303123733000 | 20110127001826000 | |
+---------+-------------+-------------------+-------------------+-------------------+
| 399 | | 20110127001826000 | 20110129001706000 | |
+---------+-------------+-------------------+-------------------+-------------------+
| 399 | | 20110129001706000 | 20120309001920000 | |
+---------+-------------+-------------------+-------------------+-------------------+
| 399 | | 20120309001920000 | 20120406001926000 | |
+---------+-------------+-------------------+-------------------+-------------------+
| 399 | | 20120406001926000 | 20120524001910000 | |
+---------+-------------+-------------------+-------------------+-------------------+
| 399 | | 20120524001910000 | 20130129180350000 | |
+---------+-------------+-------------------+-------------------+-------------------+
| 399 | | 20130129180350000 | 20131212153024000 | |
+---------+-------------+-------------------+-------------------+-------------------+
| 399 | | 20131212153024000 | 20150424184238000 | |
+---------+-------------+-------------------+-------------------+-------------------+
| 399 | | 20150424184238000 | 20160222184547000 | 20151103095530339 |
+---------+-------------+-------------------+-------------------+-------------------+
| 399 | 399 | 20151103095530339 | 20180714105242000 | 20160222184547000 |
+---------+-------------+-------------------+-------------------+-------------------+
| 399 | | 20160222184547000 | 20180911143426000 | 20180714105242000 |
+---------+-------------+-------------------+-------------------+-------------------+
| 399 | 399 | 20180714105242000 | 9999 | 20180911143426000 |
+---------+-------------+-------------------+-------------------+-------------------+
| 399 | | 20180911143426000 | 9999 | |
+---------+-------------+-------------------+-------------------+-------------------+
An image counts for a thousand words, so this is what it should look like: records injected in between
Simply put, whenever the following record has a different Customer_id format, the End date of the previous record must be updated to the value of the inserted Start_date. Of course the faulty Customer_id must also be updated to the right format (right-aligned).
So the final state of the previous data should look like: records updated as needed Updated data shows in Green color.
The solution
I am looking for a way to clean the data. I have some initial thoughts but I would like to hear some suggestions.
The database is minimally indexed, most tables are heaps, there are no referential keys.
Some numbers:
- 32M Customer records
- 4M unique Customer_ids
- 150K faulty records (wrong Customer_id format)
- 80 tables with faulty records
- 132M records is the biggest table
The limitations - Minimal downtime - New records are coming to the database (hopefully with the correct format)
I am currently working on writing a procedure that will do the following:
- Select each distinct
Customer_id
from theCUSTOMERS
table that has faulty records. Start a new Transaction
a. Create a temporary Table with all the corresponding records from table
CUSTOMERS
for thisCustomer_id
(in all formats)- i. Sort these records on the temporary table according to the Start_date
- ii. Update the End_date whenever required
- iii. Update the faulty Customer_id
- iv. Update the altered records on the
CUSTOMERS
table - by deleting all related records and inserting the updated ones from the temporary table, or- by updating the existing related records
b. Continue to the next (detail) table and create a new temporary table i. Repeat steps i-iv
- Commit the transaction when the records for the current
Customer_id
are updated in all 80 detail tables. - Loop to the next
Customer_id
and repeat steps 1-3 - Pray that everything will work according to plan…
Please share your thoughts.
PS. Sorry for the lengthy post :)