DQS How to cleanse duplicates in master table and correct IDs in all referenced tables

Question

We have lots of duplicates in a master table. See the below example, Customer ABC Corp exists 3 times in the master table and all 3 CustIDs are referenced in Orders table.

**Customers Table**
<table>
<tr><td><strong>CustID</strong></td><td><strong>CustName</strong></td></tr>
<tr><td>1001</td><td>ABC Corp.</td></tr>  
<tr><td>1002</td><td>XYZ Corp.</td></tr>  
<tr><td>1003</td><td>ABC Corp Ltd.</td></tr>  
<tr><td>1002</td><td>ABC Corporation Limited.</td></tr>  
</table>


**Orders Table**
<table>
<tr><td><strong>OrderID</strong></td><td><strong>CustID</strong></td></tr>
<tr><td>23425</td><td>1001</td></tr>  
<tr><td>23466</td><td>1003</td></tr>  
<tr><td>23488</td><td>1003</td></tr>  
<tr><td>43877</td><td>1004</td></tr>  
</table>

How can I use Data Quality Services to Remove duplicate Customers from Customers table and ALSO to update Orders table to reflect the changes.

i.e. CustIDs 1003 and 1004 merged into 1001 and 1003 and 1004 from Orders table need to be updated to 1001 too.

What I have done so far. Created a knowledge base for Customers with Synonyms for CustName and defined a leading value. And then created data cleansing project in DQS and analysed the data and then made corrections to master data and imported those corrected values to the Customers Knowledge base. I understood that DQS Cleansing Transformation in SSIS does this automatically by connecting to the source table and DQS KB and gives matching and non matching rows, But I still don't understand what to do from here. How the data in Customers table corrected and Orders table updated accordingly. Please suggest. The examples I have seen so far are using pretty basic stuff with just a table and a simple excel sheet.

As I understand you have solved your problem... Do you want to know how DQS works? — Eralper, Mar 07 '16 at 07:53
The foreign key definition : Customers.CustID (Primary key) = Orders.CustID(foreign key). — MSBI-Geek, Mar 07 '16 at 08:48
@Eralper sorry I haven't found the solution. I've created a knowledge base and data cleansing project. But how do I use this to correct data in the base SQL tables Customers and Orders? — MSBI-Geek, Mar 07 '16 at 08:50

score 1 · Answer 1 · answered Mar 07 '16 at 17:45

1

Geek,

When you run a SQL Server Data Quality Service project for cleansing as the last step the wizard requests you how to export the results.

If you choose the same source table as the target, it will replace the original data with new data after cleansing

But I think it is best option to export data into a temp table, then execute a SQL query to update validated changes

I had a CountryList table. Then by running a DQS project for data cleansing purpose and exported the results (data with cleansing info) to a new table CountryListDQS in the SQL Server database

Later, by running a SQL Update statement similar to below you can update your original data with validated changes from DQS

update dbo.CountryList
set
    country = Q.country_Output
from dbo.CountryList C
inner join dbo.CountryListDQS Q
on C.id = Q.id_Output
where Q.country_Status = 'Corrected'

answered Mar 07 '16 at 17:45

Eralper

6,461
2
21
27

Thanks, But you are only updating CountryList with Corrected values, what about the duplicates? How do you remove duplicate countries say for example you have UK, UnitedKingdom, United Kingdom, Only one should remain (UK) and others removed. The Cleansing project does not give this information. Also when you keep UK and remove other values what happens to the transaction tables that are referring UnitedKingdom and United Kingdom. I hope you see my point. – MSBI-Geek Mar 07 '16 at 23:58
Unfortunately if you mean to delete rows among we have matched to delete duplicates, it does not provide an automatic way. But if you have a lookup table where you executed cleansing mapping a number of rows to a unique one, than you can delete manually the remaining duplicates. I mean let's say when you have CountryId and CountryName fields with many duplicates, you will end up with Id's 3,5,7 with UK on name field after cleansing. Then you can update all referred tables with that foreign key to the one you selected to be unique. Then delete all others from Country lookup table manually. – Eralper Mar 08 '16 at 16:23

score 0 · Answer 2 · answered Mar 08 '16 at 17:03

After cleansing and matched list of data rows are stored in a new database table, you can first update referring table fields with new value (for duplicates of that record), then delete the duplicates in the lookup table.

Unfortunately, I don't know if DQS can do this for us. But following queries may help you.

Following one updates a referring table to CountryId field. I preferred to keep the row with minimum Id value as original and all others as duplicates.

;with cte as (
select
    NewCountryId = Min(id_Output) Over (Partition By Country_Output),
    *
from dbo.CountryListDQS
where country_Status = 'Corrected'
)
Update CityList
Set
    CountryId = cte.NewCountryId
From CityList
Inner Join cte on cte.id_Output = CityList.CountryId
where CityList.CountryId <> cte.NewCountryId

You can use SQL aggregate functions with Partition By clause like Sum(), Count() or Min(), etc.

And below Delete command removes duplicate rows in the lookup table. Perhaps it is better to set inactive using a flag field instead of deleting the rows from the table

with cte as (
select
    rn = ROW_NUMBER() Over (Partition By Country_Output Order By id_Output),
    *
from dbo.CountryListDQS
where country_Status = 'Corrected'
)
delete #CountryListTmp
from #CountryListTmp
inner join cte on #CountryListTmp.id = cte.id_Output
where rn > 1

Or best keep a copy of the lookup table. In case the data integrity problems occur due to orphan records for CountryId missing foreign key relation, you can still find the original record in the backup table.

You can further check for how to delete duplicate rows using SQL Row_Number() function as an other option.

As last point, I'm not sure Data Quality Services does deleting duplicate rows and map all other referring table rows to original value.

Hi Eralper, Thanks for the answer. I now understand DQS has some limitations as far as my requirements are concerned. I need to do lot of manual stuff. I'll present this to my manager and we will see to use DQS or not. — MSBI-Geek, Mar 14 '16 at 15:47
Perhaps you have already checked the TechNet Virtual Labs. At https://technet.microsoft.com/en-us/virtuallabs you can search for Master Data Services labs among SQL Server labs. Perhaps you can find something useful to take a decision — Eralper, Mar 15 '16 at 08:30
Yes I have gone through them, they mainly say DQS and MDS are used together. But we don't use MDS. — MSBI-Geek, Mar 15 '16 at 23:20

DQS How to cleanse duplicates in master table and correct IDs in all referenced tables

2 Answers2