0

i have a table like this:

id | arr_val  | grp
-----------------
1  | {10,20}  | -
2  | {20,30}  | -
3  | {50,5}   | -
4  | {30,60}  | -
5  | {1,5}    | -
6  | {7,6}    | -

I want to find out which rows are in a group together. In this example 1,2,4 are one group because 1 and 2 have a common element and 2 and 4. 3 and 5 form a group because they have a common element. 6 Has no common elments with anybody else. So it forms a group for itself. The result should look like this:

id | arr_val  | grp
-----------------
1  | {10,20}  | 1
2  | {20,30}  | 1
3  | {50,5}   | 2
4  | {30,60}  | 1
5  | {1,5}    | 2
6  | {7,6}    | 3

I think i need recursive cte because my problem is graphlike but i am not sure how to that.

Additional info and background:

The Table has ~2500000 rows.

In reality the problem i try to solve has more fields and conditions for finding a group:

id | arr_val  | date | val | grp
---------------------------------
1  | {10,20}  | -
2  | {20,30}  | -

Not only do the element of a group need to be linked by common elements in arr_val. They all need to have the same value in val and need to be linked by a timespan in date (gaps and islands). I solved the other two but now the condition of my question was added. If there is an easy way to do all three together in one query that would be awesome but it is not necessary.

----Edit-----

While both answer work for the example of five rows they do not work for a table with a lot more rows. Both answers have the problem that the number of rows in the recursive part explodes and only reduce them at the end. A solutiuon should work for data like this too:

id | arr_val  | grp
-----------------
1  | {1}      | -
2  | {1}      | -
3  | {1}      | -
4  | {1}      | -
5  | {1}      | -
6  | {1}      | -
7  | {1}      | -
8  | {1}      | -
9  | {1}      | -
10 | {1}      | -
11 | {1}      | -
more rows........

Is there a solution to that problem?

GMB
  • 216,147
  • 25
  • 84
  • 135
HrkBrkkl
  • 613
  • 5
  • 22

3 Answers3

1

Here is an approach at this graph-walking problem:

with recursive cte as (
    select id, arr_val, array[id] path from mytable
    union all
    select t.id, t.arr_val, c.path || t.id
    from cte c
    inner join mytable t on t.arr_val && c.arr_val and not t.id = any(c.path)
)
select c.id, c.arr_val, dense_rank() over(order by min(x.id)) grp
from cte c
cross join lateral unnest(c.path) as x(id)
group by c.id, c.arr_val
order by c.id

The common-table-expression walks the graph, recursively looking for "adjacent" nodes to the current node, while keeping track of already-visited nodes. Then the outer query aggregates, identifies groups using the least node per path, and finally ranks the groups.

Demo on DB Fiddle:

id | arr_val | grp
-: | :------ | --:
 1 | {10,20} |   1
 2 | {20,30} |   1
 3 | {50,5}  |   2
 4 | {30,60} |   1
 5 | {1,5}   |   2
 6 | {7,6}   |   3
GMB
  • 216,147
  • 25
  • 84
  • 135
1

You can handle this as a recursive CTE. Define the edges between the ids based on common values. Then traverse the edges and aggregate:

with recursive nodes as (
      select id, val
      from t cross join
           unnest(arr_val) as val
     ),
     edges as (
      select distinct n1.id as id1, n2.id as id2
      from nodes n1 join
           nodes n2
           on n1.val = n2.val
     ),
     cte as (
      select id1, id2, array[id1] as visited, 1 as lev
      from edges
      where id1 = id2
      union all
      select cte.id1, e.id2, visited || e.id2,
             lev + 1
      from cte join
           edges e
           on cte.id2 = e.id1
      where e.id2 <> all(cte.visited) 
     ),
     vals as (
      select id1, array_agg(distinct id2 order by id2) as id2s
      from cte
      group by id1
    )
select *, dense_rank() over (order by id2s) as grp
from vals;

Here is a db<>fiddle.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • why did you set lev < 8 ? – HrkBrkkl Oct 26 '20 at 12:42
  • @goodsnek . . . I removed it. That is something I use for debugging and sometimes forget to remove. – Gordon Linoff Oct 26 '20 at 18:00
  • @goodsnek . . . Is there a reason you unaccepted the answer? – Gordon Linoff Oct 28 '20 at 10:32
  • Yes, while the solution works for the example it doesnt scale. If there is just one group in my data with 20 members that are fully conected (For example 20 entries with arr_val of {1}) Your solution will walk all possible paths which is faculty of 20 and only aggregate them at the end. A working solution needs a way to reduce the amount of walked paths or parallel searches. – HrkBrkkl Oct 28 '20 at 11:36
  • by the way i was able to optimize your method a little bit. You dont have to unnest for the nodes: nodes(id, val) as ( select id, arr_val from test_cats ), edges as ( select n1.id as id1, n2.id as id2 from nodes n1 join nodes n2 on n1.val && n2.val ), – HrkBrkkl Oct 28 '20 at 11:41
  • @goodsnek . . . At least that is a reason. I added `distinct` to the `edges` CTE. That might improve performance. – Gordon Linoff Oct 28 '20 at 12:48
  • Foudn a solution that works for my case and added it as an answer but i think i will mark your question as accepted as it is the fastest for small amount of data – HrkBrkkl Oct 28 '20 at 16:20
0

While Gordon Linoffs solution is the fastest i found for a small amount of data where the groups are not so big it will not work for bigger datasets and bigger groups. I changed his solution to make it work. I moved the edges to an indexed table: create table edges ( id1 integer not null, id2 integer not null, constraint staffel_group_nodes_pk primary key (id1, id2) );

insert into edges(id1, id2) with
nodes(id, arr_val) as (
    select id, arr_val
    from my_table
)
    select n1.id as id1, n2.id as id2
    from nodes n1
             join
         nodes n2
         on n1.arr_val && n2.arr_val ;

Tha alone didnt help. I changed his recursive part too:

with recursive
    cte as (
        select id1, array [id1] as visited
        from edges
        where id1 = id2
        union all
        select unnested.id1, array_agg(distinct unnested.vis) as visited
        from (
                 select cte.id1,
                        unnest(cte.visited || e.id2) as vis
                 from cte
                          join
                      staffel_group_edges e
                      on e.id1 = any (cte.visited)
                          and e.id2 <> all (cte.visited)) as unnested
        group by unnested.id1
    ),
    vals as (
        select id1, array_agg(distinct vis) as id2s
        from (
                 select cte.id1,
                        unnest(cte.visited) as vis
                 from cte) as unnested

        group by unnested.id1
    )
select id1,id2s, dense_rank() over (order by id2s) as grp
from vals;

Every step i group all searches by their starting point. This reduces the amount of parallel walked paths a lot and works surprisingly fast.

HrkBrkkl
  • 613
  • 5
  • 22