0

I would like some ideas on the following problem.

The System

In a customer-centric system, the Customer_id is a right-aligned 9-digit code with 7 leading spaces. A valid entry would be ' 000000399'.

There is a master CUSTOMERS table and nearly 80 detail tables. The master table CUSTOMERS has multiple records for each Customer_id and uses columns Start_date, End_date for record history. The End_date is initially '9999 ' and is updated to the Start_date of a new record whenever a new record for an existing Customer_id is inserted.

An example follows. The first record for Customer_id ' 000000399' would be:

+--------------------+---------------------+---------------------+ | Cust_id | Start_date | End_date | +--------------------+---------------------+---------------------+ | ' 000000399' | '20100225210300000' | '9999 ' | +--------------------+---------------------+---------------------+

Lets assume that a new record in inserted at 2010-03-03 12:37:33,000 The new record should have an End_date of '9999' and the previous record's End_date should be updated with the new Start_date. The records of this Customer_id should be:

+--------------------+---------------------+---------------------+ | Cust_id | Start_date | End_date | +--------------------+---------------------+---------------------+ | ' 000000399' | '20100225210300000' | '20100303123733000' | | ' 000000399' | '20100303123733000' | '9999 ' | +--------------------+---------------------+---------------------+

This pattern is followed in the master and the detail tables that use a Start_date and an End_date columns.

The Problem At some point in time Customer_ids started being inserted in wrong formats. You can find the following formats for the same Customer.

'       000000399'

'000000399       '

' 000000399      '

The problem is that these Customer_ids are treated as different Customers from the system, so they maintain their own chains of Start_date and End_date. We end up having the following records for the customer (sorted by Start_date).

+---------+-------------+-------------------+-------------------+-------------------+
| Cust_id | New_Cust_id | Start_date        | End_date          | New_End_date      |
+---------+-------------+-------------------+-------------------+-------------------+
| 399     |             | 20100225210300000 | 20100303123733000 |                   |
+---------+-------------+-------------------+-------------------+-------------------+
| 399     |             | 20100303123733000 | 20110127001826000 |                   |
+---------+-------------+-------------------+-------------------+-------------------+
| 399     |             | 20110127001826000 | 20110129001706000 |                   |
+---------+-------------+-------------------+-------------------+-------------------+
| 399     |             | 20110129001706000 | 20120309001920000 |                   |
+---------+-------------+-------------------+-------------------+-------------------+
| 399     |             | 20120309001920000 | 20120406001926000 |                   |
+---------+-------------+-------------------+-------------------+-------------------+
| 399     |             | 20120406001926000 | 20120524001910000 |                   |
+---------+-------------+-------------------+-------------------+-------------------+
| 399     |             | 20120524001910000 | 20130129180350000 |                   |
+---------+-------------+-------------------+-------------------+-------------------+
| 399     |             | 20130129180350000 | 20131212153024000 |                   |
+---------+-------------+-------------------+-------------------+-------------------+
| 399     |             | 20131212153024000 | 20150424184238000 |                   |
+---------+-------------+-------------------+-------------------+-------------------+
| 399     |             | 20150424184238000 | 20160222184547000 | 20151103095530339 |
+---------+-------------+-------------------+-------------------+-------------------+
| 399     | 399         | 20151103095530339 | 20180714105242000 | 20160222184547000 |
+---------+-------------+-------------------+-------------------+-------------------+
| 399     |             | 20160222184547000 | 20180911143426000 | 20180714105242000 |
+---------+-------------+-------------------+-------------------+-------------------+
| 399     | 399         | 20180714105242000 | 9999              | 20180911143426000 |
+---------+-------------+-------------------+-------------------+-------------------+
| 399     |             | 20180911143426000 | 9999              |                   |
+---------+-------------+-------------------+-------------------+-------------------+

The Challenge

We face the challenge of cleaning the data so that there are no Customer_ids deviations and that current data keep their integrity and consistency. In the above example the solution is to inject the records with the faulty Customer_ids in the correct position (Start_date/End_date) and with the correct Customer_id format. Lets add some new columns for the columns that need to get updated.

+---------+-------------+-------------------+-------------------+-------------------+
| Cust_id | New_Cust_id | Start_date        | End_date          | New_End_date      |
+---------+-------------+-------------------+-------------------+-------------------+
| 399     |             | 20100225210300000 | 20100303123733000 |                   |
+---------+-------------+-------------------+-------------------+-------------------+
| 399     |             | 20100303123733000 | 20110127001826000 |                   |
+---------+-------------+-------------------+-------------------+-------------------+
| 399     |             | 20110127001826000 | 20110129001706000 |                   |
+---------+-------------+-------------------+-------------------+-------------------+
| 399     |             | 20110129001706000 | 20120309001920000 |                   |
+---------+-------------+-------------------+-------------------+-------------------+
| 399     |             | 20120309001920000 | 20120406001926000 |                   |
+---------+-------------+-------------------+-------------------+-------------------+
| 399     |             | 20120406001926000 | 20120524001910000 |                   |
+---------+-------------+-------------------+-------------------+-------------------+
| 399     |             | 20120524001910000 | 20130129180350000 |                   |
+---------+-------------+-------------------+-------------------+-------------------+
| 399     |             | 20130129180350000 | 20131212153024000 |                   |
+---------+-------------+-------------------+-------------------+-------------------+
| 399     |             | 20131212153024000 | 20150424184238000 |                   |
+---------+-------------+-------------------+-------------------+-------------------+
| 399     |             | 20150424184238000 | 20160222184547000 | 20151103095530339 |
+---------+-------------+-------------------+-------------------+-------------------+
| 399     | 399         | 20151103095530339 | 20180714105242000 | 20160222184547000 |
+---------+-------------+-------------------+-------------------+-------------------+
| 399     |             | 20160222184547000 | 20180911143426000 | 20180714105242000 |
+---------+-------------+-------------------+-------------------+-------------------+
| 399     | 399         | 20180714105242000 | 9999              | 20180911143426000 |
+---------+-------------+-------------------+-------------------+-------------------+
| 399     |             | 20180911143426000 | 9999              |                   |
+---------+-------------+-------------------+-------------------+-------------------+   

An image counts for a thousand words, so this is what it should look like: records injected in between

Simply put, whenever the following record has a different Customer_id format, the End date of the previous record must be updated to the value of the inserted Start_date. Of course the faulty Customer_id must also be updated to the right format (right-aligned).

So the final state of the previous data should look like: records updated as needed Updated data shows in Green color.

The solution

I am looking for a way to clean the data. I have some initial thoughts but I would like to hear some suggestions. The database is minimally indexed, most tables are heaps, there are no referential keys. Some numbers: - 32M Customer records - 4M unique Customer_ids - 150K faulty records (wrong Customer_id format) - 80 tables with faulty records - 132M records is the biggest table

The limitations - Minimal downtime - New records are coming to the database (hopefully with the correct format)

I am currently working on writing a procedure that will do the following:

  1. Select each distinct Customer_id from the CUSTOMERS table that has faulty records.
  2. Start a new Transaction

    a. Create a temporary Table with all the corresponding records from table CUSTOMERS for this Customer_id (in all formats)

    • i. Sort these records on the temporary table according to the Start_date
    • ii. Update the End_date whenever required
    • iii. Update the faulty Customer_id
    • iv. Update the altered records on the CUSTOMERS table - by deleting all related records and inserting the updated ones from the temporary table, or- by updating the existing related records

    b. Continue to the next (detail) table and create a new temporary table i. Repeat steps i-iv

  3. Commit the transaction when the records for the current Customer_id are updated in all 80 detail tables.
  4. Loop to the next Customer_id and repeat steps 1-3
  5. Pray that everything will work according to plan…

Please share your thoughts.

PS. Sorry for the lengthy post :)

  • 1
    Please format your question. As it stand it is unreadable. Do not use images with tables inside them, use text to represent data. Use [this](https://ozh.github.io/ascii-tables/) to format tables. – Andrea Oct 22 '18 at 09:40
  • 1
    Try to supply only critical info and ask short and specific questions. As it stands now it's: "I have a big problem with a lot of data, please fix it." I can fix these kind of things, however this is not to place for me to place invoices and such. – ppijnenburg Oct 22 '18 at 09:46
  • Thank you for the comments. I will try to rephrase the question. – Konstantinos Katsoridis Oct 22 '18 at 10:31
  • Please don't post sample data as ascii tables. Sample data is best served as [DDL](https://en.wikipedia.org/wiki/Data_definition_language) + [DML](https://en.wikipedia.org/wiki/Data_manipulation_language). Please [edit] your question to include it, your current attempt and your desired results. For more details, [read this.](https://dba.meta.stackexchange.com/questions/2976/help-me-write-this-query-in-sql) – Zohar Peled Oct 22 '18 at 10:34

0 Answers0