21

Hopefully I'm missing a simple solution to this.

I have two tables. One contains a list of companies. The second contains a list of publishers. The mapping between the two is many to many. What I would like to do is bundle or group all of the companies in table A which have any relationship to a publisher in table B and vise versa.

The final result would look something like this (GROUPID is the key field). Row 1 and 2 are in the same group because they share the same company. Row 3 is in the same group because the publisher Y was already mapped over to company A. Row 4 is in the group because Company B was already mapped to group 1 through Publisher Y.

Said simply, any time there is any kind of shared relationship across Company and Publisher, that pair should be assigned to the same group.

ROW   GROUPID     Company     Publisher
1     1           A           Y
2     1           A           X
3     1           B           Y
4     1           B           Z
5     2           C           W
6     2           C           P
7     2           D           W

Fiddle

Update:
My bounty version: Given the table in the fiddle above of simply Company and Publisher pairs, populate the GROUPID field above. Think of it as creating a Family ID that encompasses all related parents/children.

SQL Server 2012

Roman Pekar
  • 107,110
  • 28
  • 195
  • 197
James Frost
  • 211
  • 2
  • 4

6 Answers6

13

I thought about using recursive CTE, but, as far as I know, it's not possible in SQL Server to use UNION to connect anchor member and a recursive member of recursive CTE (I think it's possible to do in PostgreSQL), so it's not possible to eliminate duplicates.

declare @i int

with cte as (
     select
         GroupID,
         row_number() over(order by Company) as rn
     from Table1
)
update cte set GroupID = rn

select @i = @@rowcount

-- while some rows updated
while @i > 0
begin
    update T1 set
        GroupID = T2.GroupID
    from Table1 as T1
        inner join (
            select T2.Company, min(T2.GroupID) as GroupID
            from Table1 as T2
            group by T2.Company
        ) as T2 on T2.Company = T1.Company
    where T1.GroupID > T2.GroupID

    select @i = @@rowcount

    update T1 set
        GroupID = T2.GroupID
    from Table1 as T1
        inner join (
            select T2.Publisher, min(T2.GroupID) as GroupID
            from Table1 as T2
            group by T2.Publisher
        ) as T2 on T2.Publisher = T1.Publisher
    where T1.GroupID > T2.GroupID

    -- will be > 0 if any rows updated
    select @i = @i + @@rowcount
end

;with cte as (
     select
         GroupID,
         dense_rank() over(order by GroupID) as rn
     from Table1
)
update cte set GroupID = rn

sql fiddle demo

I've also tried a breadth first search algorithm. I thought it could be faster (it's better in terms of complexity), so I'll provide a solution here. I've found that it's not faster than SQL approach, though:

declare @Company nvarchar(2), @Publisher nvarchar(2), @GroupID int

declare @Queue table (
    Company nvarchar(2), Publisher nvarchar(2), ID int identity(1, 1),
    primary key(Company, Publisher)
)

select @GroupID = 0

while 1 = 1
begin
    select top 1 @Company = Company, @Publisher = Publisher
    from Table1
    where GroupID is null

    if @@rowcount = 0 break

    select @GroupID = @GroupID + 1

    insert into @Queue(Company, Publisher)
    select @Company, @Publisher

    while 1 = 1
    begin
        select top 1 @Company = Company, @Publisher = Publisher
        from @Queue
        order by ID asc

        if @@rowcount = 0 break

        update Table1 set
            GroupID = @GroupID
        where Company = @Company and Publisher = @Publisher

        delete from @Queue where Company = @Company and Publisher = @Publisher

        ;with cte as (
            select Company, Publisher from Table1 where Company = @Company and GroupID is null
            union all
            select Company, Publisher from Table1 where Publisher = @Publisher and GroupID is null
        )
        insert into @Queue(Company, Publisher)
        select distinct c.Company, c.Publisher
        from cte as c
        where not exists (select * from @Queue as q where q.Company = c.Company and q.Publisher = c.Publisher)
   end
end

sql fiddle demo

I've tested my version and Gordon Linoff's to check how it's perform. It looks like CTE is much worse, I couldn't wait while it's complete on more than 1000 rows.

Here's sql fiddle demo with random data. My results were:
128 rows:
my RBAR solution: 190ms
my SQL solution: 27ms
Gordon Linoff's solution: 958ms
256 rows:
my RBAR solution: 560ms
my SQL solution: 1226ms
Gordon Linoff's solution: 45371ms

It's random data, so results may be not very consistent. I think timing could be changed by indexes, but don't think it could change a whole picture.

old version - using temporary table, just calculating GroupID without touching initial table:

declare @i int

-- creating table to gather all possible GroupID for each row
create table #Temp
(
    Company varchar(1), Publisher varchar(1), GroupID varchar(1),
    primary key (Company, Publisher, GroupID)
)

-- initializing it with data
insert into #Temp (Company, Publisher, GroupID)
select Company, Publisher, Company
from Table1

select @i = @@rowcount

-- while some rows inserted into #Temp
while @i > 0
begin
    -- expand #Temp in both directions
    ;with cte as (
        select
            T2.Company, T1.Publisher,
            T1.GroupID as GroupID1, T2.GroupID as GroupID2
        from #Temp as T1
            inner join #Temp as T2 on T2.Company = T1.Company
        union
        select
            T1.Company, T2.Publisher,
            T1.GroupID as GroupID1, T2.GroupID as GroupID2
        from #Temp as T1
            inner join #Temp as T2 on T2.Publisher = T1.Publisher        
    ), cte2 as (
        select
            Company, Publisher,
            case when GroupID1 < GroupID2 then GroupID1 else GroupID2 end as GroupID
        from cte
    )
    insert into #Temp
    select Company, Publisher, GroupID
    from cte2
    -- don't insert duplicates
    except
    select Company, Publisher, GroupID
    from #Temp

    -- will be > 0 if any row inserted
    select @i = @@rowcount
end

select
    Company, Publisher,
    dense_rank() over(order by min(GroupID)) as GroupID
from #Temp
group by Company, Publisher

=> sql fiddle example

Roman Pekar
  • 107,110
  • 28
  • 195
  • 197
  • I thought about using a CTE as well, but there are limitations in SQL Server for recursion depth; unsure if the OP would encounter or not. – UnhandledExcepSean Sep 06 '13 at 18:10
  • it's possible to use option maxrecursion 0. Main problem is how to eliminate duplicates – Roman Pekar Sep 06 '13 at 18:11
  • `OPTION (MAXRECURSION 0)` is my friend, that part doesn't bother me. – Hart CO Sep 06 '13 at 18:13
  • @RomanPekar . . . If you are interested, you can do this with a recursive CTE. See my answer below. – Gordon Linoff Sep 07 '13 at 15:39
  • @GordonLinoff check updated answer, I thought about storing paths in strings, it's good as exercise, but it'll perform very badly for long groups. – Roman Pekar Sep 07 '13 at 21:44
  • 2
    @RomanPekar . . . I don't want to get into performance wars. But your table structure (with a primary key on company and publisher) is optimized for your approach. Testing on SQL Fiddle, I'm see somewhat different timings by putting in an identity primary key (and hence not having an index on those fields). But kudos for putting the timings together. – Gordon Linoff Sep 07 '13 at 21:56
  • @GordonLinoff trying to get best indexes for different approaches, but it's night here, so have not much time :) I like your approach too, pure SQL is always nice, started to test for performance after it haven't completed for 1000 rows in 1 minute (my previous SQL solution was also not so good. I've fixed it and tried to create RBAR solution. OP said about ~5 million rows and that should be done with linear algorithm I think). – Roman Pekar Sep 07 '13 at 22:02
  • In what circumstance do you imagine your 2nd (as appears in answer) approach would outperform your first? – Hart CO Sep 08 '13 at 13:26
  • I think it needs testing (and carefully chosen indexes), but if my first approach have to do many updates (or your data is such as at the end you'll have small numbers of large groups) linear RBAR updates could outperform it. Something like 5000000 rows and 1-2-3 groups. – Roman Pekar Sep 08 '13 at 13:30
  • @GoatCO It's all in theory, still cannot reproduce it - SQL is still faster :) – Roman Pekar Sep 08 '13 at 13:36
  • @GoatCO I've updated my answer (I've found that RBAR approach is usually slower than SQL, I've tried to do it in temporary/variable tables, but it's still slower), so I think it's better to go with SQL solution. – Roman Pekar Sep 09 '13 at 06:09
  • @RomanPekar Test the performance against my solution please. Testing on my local SQL instance, I found to to be about 100 times quicker than your CTE solution and 12 times quicker than your RBAR solution when I scaled the # of rows up over 20,000. – UnhandledExcepSean Sep 10 '13 at 19:37
  • 3
    Since you are the one doing the perf tests I post my comment here. The solution by @GordonLinoff is suffering because the `edge` CTE is executed for each invocation of the reqursive part of the CTE. It would be better to store the `edge` CTE in a temp table indexed properly and use that in the reqursive CTE I stead. Don't know how much. – Mikael Eriksson Sep 10 '13 at 20:07
  • Mikael good point, I'll try to find some time to test it tomorrow – Roman Pekar Sep 10 '13 at 20:31
  • @MikaelEriksson I've tried to use temp table or real table instead of edges CTE, had a small performance boost for small amount of rows, but still not comparable with my SQL solution (tested on SQL fiddle). – Roman Pekar Sep 13 '13 at 12:18
  • I would suggest deleting everything after the first answer, it's the best performer from my testing, and gets the bounty. OP might not ever choose an answer. – Hart CO Sep 14 '13 at 05:35
  • Agreed. I've had some time to do some testing too and the first solution outperforms everything on the page; especially when the row-count goes up. I've tried to tweak my solution (by quickly getting rid of the CROSS APPLY again which turns out to be a terrible performance killer, another something learned) and then tried many variations only to end up with what was basically a rip-off of your solution =) Impressive first try, I tip my hat to you sir! =) – deroby Sep 14 '13 at 20:25
  • @Roman Pekar The last solution is very interesting for me. May I suggest to you instead of EXCEPT operator use a unique index with IGNORE_DUP_KEY option. Thus, one can eliminate scanning operation at each iteration http://sqlfiddle.com/#!6/30346/5 – Aleksandr Fedorenko Sep 18 '13 at 07:57
  • @AlexanderFedorenko hm, I'll check it a bit later, sounds good, I've not even tested it for performance :) – Roman Pekar Sep 18 '13 at 08:03
7

Your problem is a graph-walking problem of finding connected subgraphs. It is a little more challenging because your data structure has two types of nodes ("companies" and "pubishers") rather than one type.

You can solve this with a single recursive CTE. The logic is as follows.

First, convert the problem into a graph with only one type of node. I do this by making the nodes companies and the edges linkes between companies, using the publisher information. This is just a join:

      select t1.company as node1, t2.company as node2
      from table1 t1 join
           table1 t2
           on t1.publisher = t2.publisher
     )

(For efficiency sake, you could also add t1.company <> t2.company but that is not strictly necessary.)

Now, this is a "simple" graph walking problem, where a recursive CTE is used to create all connections between two nodes. The recursive CTE walks through the graph using join. Along the way, it keeps a list of all nodes visited. In SQL Server, this needs to be stored in a string.

The code needs to ensure that it doesn't visit a node twice for a given path, because this can result in infinite recursion (and an error). If the above is called edges, the CTE that generates all pairs of connected nodes looks like:

     cte as (
      select e.node1, e.node2, cast('|'+e.node1+'|'+e.node2+'|' as varchar(max)) as nodes,
             1 as level
      from edges e
      union all
      select c.node1, e.node2, c.nodes+e.node2+'|', 1+c.level
      from cte c join
           edges e
           on c.node2 = e.node1 and
              c.nodes not like '|%'+e.node2+'%|'
     )

Now, with this list of connected nodes, assign each node the minimum of all the nodes it is connected to, including itself. This serves as an identifier of connected subgraphs. That is, all companies connected to each other via the publishers will have the same minimum.

The final two steps are to enumerate this minimum (as the GroupId) and to join the GroupId back to the original data.

The full (and I might add tested) query looks like:

with edges as (
      select t1.company as node1, t2.company as node2
      from table1 t1 join
           table1 t2
           on t1.publisher = t2.publisher
     ),
     cte as (
      select e.node1, e.node2,
             cast('|'+e.node1+'|'+e.node2+'|' as varchar(max)) as nodes,
             1 as level
      from edges e
      union all
      select c.node1, e.node2,
             c.nodes+e.node2+'|',
             1+c.level
      from cte c join
           edges e
           on c.node2 = e.node1 and
              c.nodes not like '|%'+e.node2+'%|'
     ),
     nodes as (
       select node1,
              (case when min(node2) < node1 then min(node2) else node1 end
              ) as grp
       from cte
       group by node1
      )
select t.company, t.publisher, grp.GroupId
from table1 t join
     (select n.node1, dense_rank() over (order by grp) as GroupId
      from nodes n
     ) grp
     on t.company = grp.node1;

Note that this works on finding any connected subgraphs. It does not assume that any particular number of levels.

EDIT:

The question of performance for this is vexing. At a minimum, the above query will run better with an index on Publisher. Better yet is to take @MikaelEriksson's suggestion, and put the edges in a separate table.

Another question is whether you look for equivalency classes among the Companies or the Publishers. I took the approach of using Companies, because I think that has better "explanability" (my inclination to respond was based on numerous comments that this could not be done with CTEs).

I am guessing that you could get reasonable performance from this, although that requires more knowledge of your data and system than provided in the OP. It is quite likely, though, that the best performance will come from a multiple query approach.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • @GordonLinoff check my updated answer with some performance tests, I think that this one is no way could process more than 10000 rows if the groups a long. Have no time to test today though.. – Roman Pekar Sep 07 '13 at 21:43
  • Thanks for the thorough answer, would split bounty if possible, but went with Roman's answer due to his efforts in performance testing. – Hart CO Sep 14 '13 at 05:38
2

Here is my solution SQL Fiddle

The nature of the relationships require looping as I figure.


Here is the SQL:

--drop TABLE Table1

CREATE TABLE Table1
    ([row] int identity (1,1),GroupID INT NULL,[Company] varchar(2), [Publisher] varchar(2))
;

INSERT INTO Table1
    (Company, Publisher)
select
    left(newid(), 2), left(newid(), 2)

declare @i int = 1

while @i < 8
begin
    ;with cte(Company, Publisher) as (
        select
            left(newid(), 2), left(newid(), 2)
        from Table1
    )
    insert into Table1(Company, Publisher)
    select distinct c.Company, c.Publisher
    from cte as c
    where not exists (select * from Table1 as t where t.Company = c.Company and t.Publisher = c.Publisher)

    set @i = @i + 1
end;


CREATE NONCLUSTERED INDEX IX_Temp1 on Table1 (Company)
CREATE NONCLUSTERED INDEX IX_Temp2 on Table1 (Publisher)

declare @counter int=0
declare @row int=0
declare @lastnullcount int=0
declare @currentnullcount int=0

WHILE EXISTS (
  SELECT *
  FROM Table1
  where GroupID is null
  )
BEGIN
    SET @counter=@counter+1
    SET @lastnullcount =0

    SELECT TOP 1
        @row=[row]
    FROM Table1
    where GroupID is null
    order by [row] asc

    SELECT @currentnullcount=count(*) from table1 where groupid is null
    WHILE @lastnullcount <> @currentnullcount
    BEGIN
        SELECT @lastnullcount=count(*)
        from table1
        where groupid is null 

        UPDATE Table1
        SET GroupID=@counter
        WHERE [row]=@row

        UPDATE t2
        SET t2.GroupID=@counter
        FROM Table1 t1
        INNER JOIN Table1 t2 on t1.Company=t2.Company
        WHERE t1.GroupID=@counter
        AND t2.GroupID IS NULL

        UPDATE t2
        SET t2.GroupID=@counter
        FROM Table1 t1
        INNER JOIN Table1 t2 on t1.publisher=t2.publisher
        WHERE t1.GroupID=@counter
        AND t2.GroupID IS NULL

        SELECT @currentnullcount=count(*)
        from table1
        where groupid is null
    END
END

SELECT * FROM Table1

Edit: Added indexes as I would expect on the real table and be more in line with the other data sets Roman is using.

UnhandledExcepSean
  • 12,504
  • 2
  • 35
  • 51
  • @GoatCO Unless the question is closed, new answers may appear. Up- and down-votes will change state too. And then your comment may loose its truth value. When revering to other answers, you can use links to the 'share'-urls and refrain from using terms as 'the other', 'above' and 'below'. In sql terms it's like recording 'birthday' instead of 'age'; you're referring to some immutable instead of a variable. – Chris Wesseling Sep 10 '13 at 17:37
  • @GoatCO FYI, after performance testing, my solution was markedly better performing on my local SQL server instance than any of the CTE examples I've seen in this thread. Asking Roman to add it to his performance. – UnhandledExcepSean Sep 10 '13 at 18:59
  • I've run all answers through a battery of examples, from 1000 records to 1 million records, and found otherwise, but I will re-visit it. – Hart CO Sep 10 '13 at 19:05
  • I've not much time now, here's fast checking of two solutions - http://sqlfiddle.com/#!3/5f3248/1 – Roman Pekar Sep 10 '13 at 19:33
  • @RomanPekar What's really odd about that is that I get totally different results running it in my local SQL server instance. Scale up the rows and you can really start seeing the difference. Not sure why SQL fiddle is so different. – UnhandledExcepSean Sep 10 '13 at 19:43
  • @SpectralGhost I think it's because of random data, sometimes you got long groups, sometimes you got short groups – Roman Pekar Sep 10 '13 at 19:51
  • @RomanPekar, no. I was generating a static data set and running the queries against that dataset to ensure consistency. I was setting the groupid to null before running my queries. – UnhandledExcepSean Sep 10 '13 at 19:56
  • @SpectralGhost changed my solution a bit. I've found that your solution works a little better than mine if there's 1 group. Can't check more today, have to have some sleep :) – Roman Pekar Sep 10 '13 at 20:01
  • I've run my test scenario a number of times, and Roman's answer comes out ahead by a wide margin each time, so gave it to him. Your answer performs well when there's just 1 groupID, in my testing I kept the record count to groupID ratio between 10:1 and 5:1, which I think killed your answer on larger sets. Thanks for contributing! – Hart CO Sep 14 '13 at 05:42
0

You are trying to find all of the connected components of your graph, which can only be done iteratively. If you know the maximum width of any connected component (i.e. the maximum number of links you will have to take from one company/publisher to another), you could in principle do it something like this:

SELECT
    MIN(x2.groupID) AS groupID,
    x1.Company,
    x1.Publisher
FROM Table1 AS x1
    INNER JOIN (
        SELECT
            MIN(x2.Company) AS groupID,
            x1.Company,
            x1.Publisher
        FROM Table1 AS x1
            INNER JOIN Table1 AS x2
            ON x1.Publisher = x2.Publisher
        GROUP BY
            x1.Publisher,
            x1.Company
    ) AS x2
    ON x1.Company = x2.Company
GROUP BY
    x1.Publisher,
    x1.Company;

You have to keep nesting the subquery (alternating joins on Company and Publisher, and with the deepest subquery saying MIN(Company) rather than MIN(groupID)) to the maximum iteration depth.

I don't really recommend this, though; it would be cleaner to do this outside of SQL.

Disclaimer: I don't know anything about SQL Server 2012 (or any other version); it may have some kind of additional scripting ability to let you do this iteration dynamically.

Alex Godofsky
  • 708
  • 4
  • 16
  • Typically in child/parent questions a recursive cte is used, to connect successive pairs, but I've never seen an example that groups results in both directions into a 'family'. – Hart CO Sep 06 '13 at 17:18
  • As you can see in my answer http://stackoverflow.com/a/18668786/1980282, a recursive solution is possible. – FrankPl Sep 09 '13 at 07:47
0

This is a recursive solution, using XML:

with a as ( -- recursive result, containing shorter subsets and duplicates
    select cast('<c>' + company + '</c>' as xml) as companies
          ,cast('<p>' + publisher + '</p>' as xml) as publishers
      from Table1

    union all

    select a.companies.query('for $c in distinct-values((for $i in /c return string($i),
                                                        sql:column("t.company")))
                          order by $c
                          return <c>{$c}</c>')
          ,a.publishers.query('for $p in distinct-values((for $i in /p return string($i),
                                                         sql:column("t.publisher")))
                          order by $p
                          return <p>{$p}</p>')
    from a join Table1 t
      on (   a.companies.exist('/c[text() = sql:column("t.company")]') = 0 
          or a.publishers.exist('/p[text() = sql:column("t.publisher")]') = 0)
     and (   a.companies.exist('/c[text() = sql:column("t.company")]') = 1
          or a.publishers.exist('/p[text() = sql:column("t.publisher")]') = 1)
), b as ( -- remove the shorter versions from earlier steps of the recursion and the duplicates
    select distinct -- distinct cannot work on xml types, hence cast to nvarchar
           cast(companies as nvarchar) as companies
          ,cast(publishers as nvarchar) as publishers
          ,DENSE_RANK() over(order by cast(companies as nvarchar), cast(publishers as nvarchar)) as groupid
     from a
    where not exists (select 1 from a as s -- s is a proper subset of a
                       where (cast('<s>' + cast(s.companies as varchar)
                                 + '</s><a>' + cast(a.companies as varchar) + '</a>' as xml)
                             ).value('if((count(/s/c) > count(/a/c))
                                         and (some $s in /s/c/text() satisfies
                                             (some $a in /a/c/text() satisfies $s = $a))
                                      ) then 1 else 0', 'int') = 1
                     )
      and not exists (select 1 from a as s -- s is a proper subset of a
                       where (cast('<s>' + cast(s.publishers as nvarchar)
                                 + '</s><a>' + cast(a.publishers as nvarchar) + '</a>' as xml)
                             ).value('if((count(/s/p) > count(/a/p))
                                         and (some $s in /s/p/text() satisfies
                                             (some $a in /a/p/text() satisfies $s = $a))
                                      ) then 1 else 0', 'int') = 1
                     )
), c as (  -- cast back to xml
    select cast(companies as xml) as companies
          ,cast(publishers as xml) as publishers
          ,groupid
      from b
)
select Co.company.value('(./text())[1]', 'varchar') as company
      ,Pu.publisher.value('(./text())[1]', 'varchar') as publisher
      ,c.groupid
  from c
       cross apply companies.nodes('/c') as Co(company)
       cross apply publishers.nodes('/p') as Pu(publisher)
 where exists(select 1 from Table1 t -- restrict to only the combinations that exist in the source
               where t.company = Co.company.value('(./text())[1]', 'varchar')
                 and t.publisher = Pu.publisher.value('(./text())[1]', 'varchar')
             )

The set of companies and the set of publishers are kept in XML fields in the intermediate steps, and there is some casting between xml and nvarchar necessary due to some limitations of SQL Server (like not being able to group or use distinct on XML columns.

FrankPl
  • 13,205
  • 2
  • 14
  • 40
0

Bit late to the challenge, and since SQLFiddle seems to be down ATM I'll have to guess your data-structures. Nevertheless, it seemed like a fun challenge (and it was =) so here's what I made from it :

Setup:

IF OBJECT_ID('t_link') IS NOT NULL DROP TABLE t_link
IF OBJECT_ID('t_company') IS NOT NULL DROP TABLE t_company
IF OBJECT_ID('t_publisher') IS NOT NULL DROP TABLE t_publisher
IF OBJECT_ID('tempdb..#link_A') IS NOT NULL DROP TABLE #link_A
IF OBJECT_ID('tempdb..#link_B') IS NOT NULL DROP TABLE #link_B
GO

CREATE TABLE t_company ( company_id     int IDENTITY(1, 1) NOT NULL PRIMARY KEY,
                         company_name   varchar(100) NOT NULL)

GO 

CREATE TABLE t_publisher (publisher_id     int IDENTITY(1, 1) NOT NULL PRIMARY KEY,
                          publisher_name   varchar(100) NOT NULL)

CREATE TABLE t_link (company_id int NOT NULL FOREIGN KEY (company_id) REFERENCES t_company (company_id),
                     publisher_id int NOT NULL FOREIGN KEY (publisher_id) REFERENCES t_publisher (publisher_id),
                                PRIMARY KEY (company_id, publisher_id),
                     group_id int NULL
                             )
GO

-- example content


-- ROW   GROUPID     Company     Publisher
--1     1           A           Y
--2     1           A           X
--3     1           B           Y
--4     1           B           Z
--5     2           C           W
--6     2           C           P
--7     2           D           W


INSERT t_company (company_name) VALUES ('A'), ('B'), ('C'), ('D')
INSERT t_publisher (publisher_name) VALUES ('X'), ('Y'), ('Z'), ('W'), ('P')

INSERT t_link (company_id, publisher_id)
SELECT company_id, publisher_id
  FROM t_company, t_publisher
 WHERE (company_name = 'A' AND publisher_name = 'Y')
    OR (company_name = 'A' AND publisher_name = 'X')
    OR (company_name = 'B' AND publisher_name = 'Y')
    OR (company_name = 'B' AND publisher_name = 'Z')
    OR (company_name = 'C' AND publisher_name = 'W')
    OR (company_name = 'C' AND publisher_name = 'P')
    OR (company_name = 'D' AND publisher_name = 'W')




GO

/*
-- volume testing

TRUNCATE TABLE t_link
DELETE t_company
DELETE t_publisher


DECLARE @company_count   int = 1000,
        @publisher_count int = 450,
        @links_count     int = 800


INSERT t_company (company_name)
SELECT company_name    = Convert(varchar(100), NewID())
  FROM master.dbo.fn_int_list(1, @company_count) 

UPDATE STATISTICS t_company

INSERT t_publisher (publisher_name)
SELECT publisher_name  = Convert(varchar(100), NewID())
  FROM master.dbo.fn_int_list(1, @publisher_count) 

UPDATE STATISTICS t_publisher

-- Random links between the companies & publishers

DECLARE @count int
SELECT @count = 0

WHILE @count < @links_count
    BEGIN

        SELECT TOP 30 PERCENT row_id = IDENTITY(int, 1, 1), company_id = company_id + 0
          INTO #link_A
          FROM t_company
         ORDER BY NewID()

        SELECT TOP 30 PERCENT row_id = IDENTITY(int, 1, 1), publisher_id = publisher_id + 0
          INTO #link_B
          FROM t_publisher
         ORDER BY NewID()

        INSERT TOP (@links_count - @count) t_link (company_id, publisher_id)
        SELECT A.company_id,
               B.publisher_id
          FROM #link_A A
          JOIN #link_B B
            ON A.row_id = B.row_id
         WHERE NOT EXISTS ( SELECT *
                              FROM t_link old
                             WHERE old.company_id   = A.company_id
                               AND old.publisher_id = B.publisher_id)

        SELECT @count = @count + @@ROWCOUNT

        DROP TABLE #link_A
        DROP TABLE #link_B    
    END

*/

Actual grouping:

IF OBJECT_ID('tempdb..#links') IS NOT NULL DROP TABLE #links
GO

-- apply grouping

-- init
SELECT row_id = IDENTITY(int, 1, 1), 
       company_id,
       publisher_id,
       group_id = 0
  INTO #links
  FROM t_link

-- don't see an index that would be actually helpful here right-away, using row_id to avoid HEAP
CREATE CLUSTERED INDEX idx0 ON #links (row_id)
--CREATE INDEX idx1 ON #links (company_id)   
--CREATE INDEX idx2 ON #links (publisher_id)

UPDATE #links
   SET group_id = row_id


-- start grouping
WHILE @@ROWCOUNT > 0
    BEGIN  
        UPDATE #links
           SET group_id = new_group_id
          FROM #links upd
          CROSS APPLY (SELECT new_group_id = Min(group_id)
                         FROM #links new
                        WHERE new.company_id   = upd.company_id
                           OR new.publisher_id = upd.publisher_id 
                                     ) x
        WHERE upd.group_id > new_group_id

        -- select * from #links
    END


-- remove 'holes'
UPDATE #links
   SET group_id = (SELECT COUNT(DISTINCT o.group_id) 
                          FROM #links o
                         WHERE o.group_id <= upd.group_id)
  FROM #links upd

GO

UPDATE t_link
   SET group_id = new.group_id
  FROM t_link upd
  LEFT OUTER JOIN #links new
               ON new.company_id = upd.company_id
              AND new.publisher_id = upd.publisher_id

GO    
SELECT row = ROW_NUMBER() OVER (ORDER BY group_id, company_name, publisher_name),
       l.group_id,
       c.company_name, -- c.company_id,
       p.publisher_name -- , p.publisher_id
 from t_link l
 JOIN t_company c
   ON l.company_id = c.company_id
 JOIN t_publisher p 
   ON p.publisher_id = l.publisher_id
 ORDER BY 1

At first sight this approach hasn't been tried yet by anyone else, interesting to see how this can be done in a variety of ways... (preferred not to read them upfront as it would spoil the puzzle =)

Results look as expected (as far as I understand the requirements and the example) and performance isn't too shabby either although there is no real indication on the amount of records this should work on; not sure how it would scale but don't expect too many problems either...

deroby
  • 5,902
  • 2
  • 19
  • 33