Duplicates on Self Left Join

Question

I'm trying to pivot out a table of data stored in a vertical model into a more horizontal, SQL Server table-like model. Unfortunately due to the nature of the data, I cannot use the real data here so I worked up a generic example that follows the same model.

There are three columns to the table, an ID, column ID and value, where the ID and column ID form the Primary Key. Additionally none of the data is required (i.e. an ID can be missing column ID = 3 without breaking anything)

PetID | ColumnID | Value
---------------------------
1     | 1        | Gilda
1     | 2        | Cat
2     | 1        | Sonny
2     | 2        | Cat
2     | 3        | Black

Due to the fact that the Primary Key is a composite of two columns I cannot use the built in PIVOT functionality, so I tried doing a self LEFT JOIN:

SELECT T1.PetID
    ,T2.Value AS [Name]
    ,T3.Value AS [Type]
    ,T4.Value AS [Color]
FROM @Temp AS T1
LEFT JOIN @Temp AS T2 ON T1.PetID = T2.PetID
    AND T2.ColumnID = 1
LEFT JOIN @Temp AS T3 ON T1.PetID = T3.PetID
    AND T3.ColumnID = 2
LEFT JOIN @Temp AS T4 ON T1.PetID = T4.PetID
    AND T4.ColumnID = 3;

The idea being that I want to take the ID from T1 and then do a self LEFT JOIN to get each of the values by ColumnID. However I'm getting duplicates in the data:

PetID | Name  | Type | Color
------------------------------
1     | Gilda | Cat  | NULL
1     | Gilda | Cat  | NULL
2     | Sonny | Cat  | Black
2     | Sonny | Cat  | Black
2     | Sonny | Cat  | Black

I am able to get rid of these duplicates using a DISTINCT, but the dataset is rather large, so the required sort action is slowing down the query tremendously. Is there a better way to accomplish this or am I just stuck with a slow query?

There's quite possibly a better way to do the joining part, but I do believe GROUP BY is faster than DISTINCT, and you can use that to remove duplicates. — GendoIkari, Jan 11 '16 at 18:56
@GendoIkari, true but the GROUP BY will still use a sort in the background and slow down the query. I'm really looking for a more efficient way to do the joins, but thank you for the suggestion. — Steven Ball, Jan 11 '16 at 19:01
Join is going to bring in more rows but you have some interesting answers to get around joins — paparazzo, Jan 11 '16 at 19:24
Are you really using a table variable? If so try with #temp. You can get terrible execution plans with table variables. Table variable are often not good many rows. — paparazzo, Jan 11 '16 at 19:30
@Frisbee, the real data uses a table in SQL Server. The example is just using a table variable for simplicity sake. — Steven Ball, Jan 11 '16 at 19:33

score 1 · Accepted Answer · answered Jan 11 '16 at 19:09

1

You can use a CASE statement and avoid the joins altogether.

SELECT
    PetID,
    MAX(CASE WHEN ColumnID = 1 THEN Value ELSE NULL END) AS Name,
    MAX(CASE WHEN ColumnID = 2 THEN Value ELSE NULL END) AS Type,
    MAX(CASE WHEN ColumnID = 3 THEN Value ELSE NULL END) AS Color
FROM @Temp
GROUP BY PetId

It is essential that PetID, ColumnID be your primary key for this to work correctly. Otherwise it will cause problems when the same ColumnID is used multiple times for the same PetID

answered Jan 11 '16 at 19:09

Munir

3,442
3
19
29

I will give this a try. Thank you for the suggestion! – Steven Ball Jan 11 '16 at 19:21
I tried a couple of answers provided and this one provided the fastest execution time in the real data. – Steven Ball Jan 11 '16 at 22:12

JamieD77 · Answer 2 · 2016-01-11T19:22:28.170

1

You can use pivot if you'd like to..

SELECT  *
FROM    (SELECT PetID,
                (CASE ColumnID
                   WHEN 1 THEN 'Name'
                   WHEN 2 THEN 'Type'
                   WHEN 3 THEN 'Color'
                 END) ValueType,
                VALUE
         FROM   @Temp
        ) t 
PIVOT
(   MAX(Value) 
    FOR ValueType IN ([Name],[Type],[Color]) 
) p

Another way without the Sub query would be..

SELECT  PetID,
        [1] [Name],
        [2] [Type],
        [3] [Color]
FROM    @Temp
PIVOT
(   MAX(Value) 
    FOR ColumnID IN ([1],[2],[3]) 
) p

edited Jan 11 '16 at 19:22

answered Jan 11 '16 at 19:14

JamieD77

13,796
1
17
27

Very interesting! I will give this a try. Thank you for the suggestion. – Steven Ball Jan 11 '16 at 19:21
I tried this with the real data and another answer ended up being faster, but this is good information to use in future queries. – Steven Ball Jan 11 '16 at 22:15

shawnt00 · Answer 3 · 2016-01-11T19:21:11.913

0

I don't understand your concern about sorting. You have a primary key so you also have an index. This is the correct way to do it:

select
    PetID,
    min(case when ColumnID = 1 then Value end) as Name,
    min(case when ColumnID = 2 then Value end) as Type,
    min(case when ColumnID = 3 then Value end) as Color
from @Temp
group by PetID

A fix for your duplication is simple though and will probably improve performance as well:

FROM (select distinct PetID from @Temp) AS T1

edited Jan 11 '16 at 19:21

answered Jan 11 '16 at 19:09

shawnt00

16,443
3
17
22

The problem with sorting is its a costly action for SQL Server and slows down query execution, particularly for large datasets. My example was just a small sample, as my full dataset is very large. It is generally considered best practice to refrain from sorting unless absolutely needed. – Steven Ball Jan 11 '16 at 19:14
Your data is already sorted because you have defined a primary key. How is a four-way join going to be faster than a single scan anyway? – shawnt00 Jan 11 '16 at 19:16

score 0 · Answer 4 · answered Jan 11 '16 at 19:46

SELECT T1.PetID
    ,T1.Value AS [Name]
    ,T2.Value AS [Type]
    ,T3.Value AS [Color]
    --select *
FROM #Temp AS T1
LEFT JOIN #Temp AS T2 ON T1.PetID = T2.PetID
    AND T2.ColumnID = 2
LEFT JOIN #Temp AS T3 ON T1.PetID = T3.PetID
    AND T3.ColumnID = 3
where t1.ColumnID = 1

Your problem was that you were joining to the main table that had multiple rows.

Duplicates on Self Left Join

4 Answers4