SELECT DISTINCT for data groups

Question

I have following table:

ID  Data
1   A
2   A
2   B
3   A
3   B
4   C
5   D 
6   A
6   B

etc. In other words, I have groups of data per ID. You will notice that the data group (A, B) occurs multiple times. I want a query that can identify the distinct data groups and number them, such as:

DataID     Data
101        A
102        A
102        B
103        C
104        D

So DataID 102 would resemble data (A,B), DataID 103 would resemble data (C), etc. In order to be able to rewrite my original table in this form:

ID   DataID
1    101
2    102
3    102
4    103
5    104
6    102

How can I do that?

PS. Code to generate the first table:

CREATE TABLE #t1 (id INT, data VARCHAR(10))
INSERT INTO #t1
SELECT 1, 'A'
UNION ALL SELECT 2, 'A'
UNION ALL SELECT 2, 'B'
UNION ALL SELECT 3, 'A'
UNION ALL SELECT 3, 'B'
UNION ALL SELECT 4, 'C'
UNION ALL SELECT 5, 'D'
UNION ALL SELECT 6, 'A'
UNION ALL SELECT 6, 'B'

@littlegreen - what is the basis of your row 3 2 in your expected output? — Sachin Shanbhag, Nov 25 '10 at 10:27
I'm struggling to see what you mean/require. How are ID and DataID related (if at all)? Any chance you could rework the example such that there are no values re-used between tables unless they refer to the same thing? — John Pickup, Nov 25 '10 at 10:29
Because ID 3 has data (A, B) in my original table, and that data combination has received DataID 2. — littlegreen, Nov 25 '10 at 10:30
@JohnPickup: Sure, I renumbered DataID. Hope it's clearer now. — littlegreen, Nov 25 '10 at 10:32
@littlegreen - are the values in your `data` column a limited set of fixed values that you know before hand? — Jagmag, Nov 25 '10 at 10:44
@InSane: yes, there are about 1000 of those values and I have already assigned them an ID. I just presented them here as 'A' and 'B' etc for clarity. — littlegreen, Nov 25 '10 at 10:46
@PDBA: I don't agree. Basically, luckyluke has given the answer to the problem. I am rewarding him the points. — littlegreen, Dec 06 '10 at 14:50

luckyluke · Accepted Answer · 2010-11-25T10:41:53.063

3

In my opinion You have to create a custom aggregate that concatenates data (in case of strings CLR approach is recommended for perf reasons). Then I would group by ID and select distinct from the grouping, adding a row_number()function or add a dense_rank() your choice. Anyway it should look like this

with groupings as (
select concat(data) groups
from Table1
group by ID
)
select groups, rownumber() over () from groupings

edited Nov 25 '10 at 10:41

answered Nov 25 '10 at 10:29

luckyluke

1,553
9
16

That's just what I was hoping to avoid... (the concat) – littlegreen Nov 25 '10 at 10:44
Unfortunately that is probably the solution. How can You dinstiguish different data groups in the other way than creating some king of aggregate (summary for each ID) and comparing it with others. Why the aggregate solution is not viable for You. Are the strings long, or is the data some kind of BitMap oor file? – luckyluke Nov 25 '10 at 10:49
The strings are quite long. I have assigned them an integer ID, but to concatenate these integers into a string and distinct them for this purpose seems like an ugly hack to me. It could be the only solution yes, but I wanted to check here before implementing it. – littlegreen Nov 25 '10 at 10:50
The aggregate needs not to be a 'dirty' string concatenation It can be an a function with a property turning a collection of integers into a unique value (of whatever type). It has something to do with lossless compression or hashing. – luckyluke Nov 25 '10 at 11:00
That is true. I don't know of such a function, but I can try to find it. – littlegreen Nov 25 '10 at 11:10

score 2 · Answer 2 · answered Nov 25 '10 at 11:04

The following query using CASE will give you the result shown below.

From there on, getting the distinct datagroups and proceeding further should not really be a problem.

SELECT     
    id, 
     MAX(CASE data WHEN 'A' THEN data ELSE '' END) + 
     MAX(CASE data WHEN 'B' THEN data ELSE '' END) + 
     MAX(CASE data WHEN 'C' THEN data ELSE '' END) + 
     MAX(CASE data WHEN 'D' THEN data ELSE '' END) AS DataGroups
FROM  t1
GROUP BY id

ID  DataGroups
1   A
2   AB
3   AB
4   C
5   D
6   AB

However, this kind of logic will only work in case you the "Data" values are both fixed and known before hand.

In your case, you do say that is the case. However, considering that you also say that they are 1000 of them, this will be frankly, a ridiculous looking query for sure :-)

LuckyLuke's suggestion above would, frankly, be the more generic way and probably saner way to go about implementing the solution though in your case.

score 0 · Answer 3 · answered Nov 25 '10 at 11:11

From your sample data (having added the missing 2,'A' tuple, the following gives the renumbered (and uniqueified) data:

with NonDups as (
select t1.id
from #t1 t1 left join #t1 t2
on t1.id > t2.id and t1.data = t2.data
group by t1.id
having COUNT(t1.data) > COUNT(t2.data)
), DataAddedBack as (
    select ID,data
    from #t1 where id in (select id from NonDups)
), Renumbered as (
    select DENSE_RANK() OVER (ORDER BY id) as ID,Data from DataAddedBack
)
select * from Renumbered

Giving:

1          A
2          A
2          B
3          C
4          D

I think then, it's a matter of relational division to match up rows from this output with the rows in the original table.

score 0 · Answer 4 · answered Nov 25 '10 at 11:20

Just to share my own dirty solution that I'm using for the moment:

SELECT DISTINCT t1.id, D.data
FROM #t1 t1
CROSS APPLY ( 
    SELECT CAST(Data AS VARCHAR) + ','
    FROM #t1 t2
    WHERE t2.id = t1.id
    ORDER BY Data ASC
    FOR XML PATH('') )  
D ( Data )

And then going analog to LuckyLuke's solution.

SELECT DISTINCT for data groups

4 Answers4