3

I have a data set that has some column where values match, but the rest of the column values do not. I need to delete duplicates where SubCategory of a lower level (Level2, Level3 and Level 4) "IS NOT NULL" but its corresponding "duplicate partner" (grouped by [SubCategory Level 1 ID], [Product Category] and [Product Name]) has the same lower level SubCategory - "IS NULL". Per table below I need to remove ID 2, 4, 6 and 9 (see highlighted in red font).

I've tried Dense_Rank, Rank and Row_Number functions with Partition By but that did not give me the disired output. Maybe I need to use their combination...

Eg.: RowID 1 and 2 are duplicates by [Product Category], [Product Name], [Category Level 1]. "Category Level 1" is just an ID of "Product Category". In need to remove RowID 2 because its corresponding duplicate partner RowID 1 has no "Category Level 3" assigned when RowID 2 has. Same logic applues to RowID 9 and 10, but at this time RowID 9 has "Category Level 2" where Row 10 does not. If both duplicates (RowID 1 and 2) would have "Category Level 3" assigned we would not need to delete any of them

Sample Dataset

IF OBJECT_ID('tempdb..#Category', 'U') IS NOT NULL
  DROP TABLE #Category;
GO

CREATE TABLE #Category 
(
    RowID INT NOT NULL,
    CategoryID INT NOT NULL,
    ProductCategory VARCHAR(100) NOT NULL,
    ProductName VARCHAR(100) NOT NULL,
    [SubCategory Level 1 ID] INT NOT NULL,
    [SubCategory Level 2 ID] INT NULL,
    [SubCategory Level 3 ID] INT NULL,  
    [SubCategory Level 4 ID] INT NULL
);

INSERT INTO #Category (RowID, CategoryID, ProductCategory, ProductName, [SubCategory Level 1 ID], [SubCategory Level 2 ID], [SubCategory Level 3 ID], [SubCategory Level 4 ID])
VALUES 
(1, 111, 'Furniture', 'Table', 200, 111, NULL, NULL),
(2, 234, 'Furniture', 'Table', 200, 234, 123, NULL),
(3, 122, 'Furniture', 'Chair', 200, 122, NULL, NULL),
(4, 122, 'Furniture', 'Chair', 200, 122, 32, NULL),
(5, 12, 'Auto', 'Trucks', 300, 766, 12, NULL),
(6, 3434, 'Auto', 'Trucks', 300, 322, 3434, 333),
(7, 332, 'Auto', 'Sport Vehicles', 300, 332, NULL, NULL),
(8, 332, 'Auto', 'Sport Vehicles', 300, 332, NULL, NULL),
(9, 300, 'Auto', 'Sedans', 300, 231, NULL, NULL),
(10, 300, 'Auto', 'Sedans', 300, NULL, NULL, NULL),
(11, 300, 'Auto', 'Cabriolet', 300, 456, 688, NULL),
(12, 300, 'Auto', 'Cabriolet', 300, 456, 976, NULL),
(13, 300, 'Auto', 'Motorcycles', 300, 456, 235, 334),
(14, 300, 'Auto', 'Motorcycles', 300, 456, 235, 334);


SELECT * FROM #Category; 
-- ADD YOU CODE HERE TO RETURN the following RowIDs: 2, 4, 6, 9
Data Engineer
  • 795
  • 16
  • 41
  • please show your existing query – Squirrel May 18 '19 at 06:38
  • Your requirement is hard to understand. Can you explain why 2, 4, 6 and 9 should be removed? – forpas May 18 '19 at 18:28
  • Hi Forpas, they should be removed for the same exact reason. Eg : RowID 1 and 2 are duplicates by [Product Category], [Product Name], [Category Level 1]. "Category level 1" is just an ID of "Product Category". In need to remove RowID 2 because it's corresponding partner RowID 1 has no "Category Level 3" assigned when RowID 2 has. Same logic for RowID 9 and 10, but at this time RowID 9 has "Category Level 2" where Row 10 does not. If both duplicates (RowID 1 and 2) would have "Category Level 3" assigned we would not need to delete any of them. – Data Engineer May 18 '19 at 20:10

2 Answers2

1

If I understand this right, your logic is the following:

For each unique SubCategory Level 1, Product Category, and Product Name combination, you want to return the row which has the least amount of filled in SubCategory level data.

Using a quick dense_rank with partitions on the relevant fields, you can order the rows with less Sub Categories levels to be set to 1. Rows 2, 4, 6, and 9 should now be the only rows returned.

;with DataToSelect
as
(
    SELECT *,
           DENSE_RANK() OVER(PARTITION BY [ProductCategory], [ProductName], [SubCategory Level 1 ID] 
                    ORDER BY
                        CASE 
                            WHEN [SubCategory Level 4 ID] IS NOT NULL THEN 3
                            WHEN [SubCategory Level 3 ID] IS NOT NULL THEN 2
                            WHEN [SubCategory Level 2 ID] IS NOT NULL THEN 1
                            END) as [ToInclude]
    FROM #Category
)
SELECT *
FROM 
    DataToSelect
WHERE 
    ToInclude != 1
ORDER BY 
    RowID

Keep in mind if you have two rows with the same SubCategory level per SubCategory Level 1, Product Category, and Product Name combination, they'll both be included. If you do not want this, just swap the dense_rank to row_number and add some alternative criteria on which should be selected first.

Data Engineer
  • 795
  • 16
  • 41
Matt
  • 2,851
  • 1
  • 13
  • 27
  • Hi Matt, I tried this approach already, It's not what I'm trying to achive. As I said I need Rows 2, 4, 6, and 9 only. The logic si to remove all duplicate by (Category 1, Category Name and Product Name) records where lower level category is NOT NULL in one or more duplicates but where one of the Duplicate rows has not the same lower level category assigned (IS NULL). So I would keep only one row with higher lever category assigned and without lower level (see screenshot above. it has the same exact pattern) – Data Engineer May 18 '19 at 15:33
  • You query returns RowIDs - 1, 3, 6, 7, 8 and 10 – Data Engineer May 18 '19 at 15:34
  • OK, your question was quite unclear. If you only want rows `2`,`4`,`6`,`9`, then I've just swapped the where clause to `!=` instead. Why do you want to receive the duplicate rows and not the non-duplicate rows? – Matt May 18 '19 at 21:13
  • Matt, I did update my code with an example that answers your question. Also when switching to "ToInclude !=1" is returns RowID 2, 4, 5, and 9. We should not remove RowID 5. But do need to delete duplicate Row_ID 6 as it has "Category Level 4 assigned" (333). So, you can see that it works correctly only for Category Level 2 and "Level 3" not for "Category Level 4". – Data Engineer May 18 '19 at 22:11
  • Matt, you were very close. We also need to twist the CASE Statement as follows WHEN [SubCategory Level 4 ID] IS not NULL THEN 3 WHEN [SubCategory Level 3 ID] IS not NULL THEN 2 WHEN [SubCategory Level 2 ID] IS not NULL THEN 1 I updated your code. Thanks a lot! – Data Engineer May 18 '19 at 22:33
1

this thread helped me a lot with understanding a different method to removing duplicate date. I want to thank the original contributors. I did however notice that the final solution is incomplete. The original poster wanted the results to return RowId's 2,4,6,9 however the ToInclude != 1 filter doesnt allow that. I am adding the code to complete the query by adding a where > 1 filter which will produce the intended result. See the code below:

;with DataToSelect
as
(
    SELECT *,
           DENSE_RANK() OVER(PARTITION BY [ProductCategory], [ProductName], [SubCategory Level 1 ID] 
                    ORDER BY
                        CASE 
                            WHEN [SubCategory Level 4 ID] IS NOT NULL THEN 3
                            WHEN [SubCategory Level 3 ID] IS NOT NULL THEN 2
                            WHEN [SubCategory Level 2 ID] IS NOT NULL THEN 1
                            END) as [ToInclude]
    FROM #Category
)
SELECT *
FROM 
    DataToSelect
WHERE 
    ToInclude > 1
ORDER BY 
    RowID

This returns:

Results Table of Code

Chadwick
  • 11
  • 3