Concept
Your problem basically consists of 3 steps:
- identify the connected components of an undirected graph
G
- represent each connected component by its vertex with the minimal id
- map each edge of the graph to the representative of the connected component it belongs to.
The task is solved by:
- considering an 'equivalent' directed graph
D_1
based on the natural numerical ordering of G
's vertex ids.
- reducing
D_1
to a bipartite graph B
containing all candidate ids by abstracting away chains in the ordering represented by D_1
.
- determining minimal representative ids in
B
.
- establishing a mapping from
G
's edges onto B
's edges and thus its weakly connected components (WCCs).
Details
The edge set of G
is precisely the set of pairs (id1, id2)
given by the records of unique_pairs
. The vertex set of G
matches the set of values from both id
columns of unique_pairs
. As G
is undirected, it does not matter whether to represent an edge by the record (id1, id2)
or (id2, id1)
; henceforth we assume wlog that id1 < id2
holds for each record (id1, id2)
in unique_pairs
.
We tackle the task by constructing an auxiliary structure based on the digraph D_1
derived from G
by orienting each edge of the undirected graph according to the numerical order of the ids of its incident vertices. Let D_2
be D_1
with all edge orientations reversed. Identifying the connected components of the original graph amounts to identifying the WCCs in either D_1
or D_2
. Observe that D_1
is acyclic by construction.
The assisting auxiliary structure is a mapping of the set of edges in G
onto the powerset of maximal chains ( = a path not contained in another path ) in D_1
, such that all chains in the image of an edge e
contain e
and maximize the difference between the numerical id of its terminals ( = start and end vertices ). The use of powersets is a technicality as otherwise the mapping might not be well-defined when two maximal chains share the same terminals but follow 'alternative routes' (usually the image will contain a single element).
In the next step we coarsen the mapping: instead of sets of maximal chains the image will be the pair of terminal ids of any maximal chain from the original mapping's image.
Given this mapping, we build an auxiliary bipartite graph B
whose edges precisely represent the ordered pairs of terminal ids. The graph is bipartite since no end vertex of a maximal chain can be a start vertex of another maximal chain and vice versa - if this happened, said chains wouldn't have been maximal.
However, by construction, B
forms a part of D_1
's transitive closure. Therefore, the vertex sets of B
's WCCs are subsets of the vertex sets of D_1
's WCCs. We are ultimately interested in the vertex with the minimal id in each of the latter WCCs. As no terminal id contained in a maximal chain can be smaller than the start vertex id, all of these vertices must be contained in B
's vertex set. So we have to consider B
's WCCs only taking the minimum terminal id of each of them. Together with the auxiliary mapping of all G
edges onto B
edges and thus onto B
's WCCs, we solve the original problem.
SQL implementation
In a preprocessing step, G
's edges are normalized to the natural ordering of its incident vertices. Assuming that the orientation of the edges is id1 -> id2
, test_unique_pairs
also represents D_1
.
update test_unique_pairs
set id1 = id2
, id2 = id1
where id1 > id2
;
The first query maps the edges of D_1
to the edges of B
, ie. the pairs of terminal ids of maximal chains in D_1
. These chains are acyclic by construction and will be computed by a hierarchical query on the test_unique_pairs
and they will be represented by their start and end vertices, these pairs also representing edges in B
. This mapping occurs modulo alternative chains sharing both terminals - different maximal paths are considered equivalent if they share both terminals. Hierarchical queries in oracle allow for easy extraction of the maximal elements. The minimal elements, however, are the maximal elements of the hierarchy with parent-child relationships reversed (to cater for this computation is the reason for considering both digraphs D_1
, D_2
).
In order to ease the second processing stage, the query id wrapped into a view definition:
create or replace view test_up_chains as
select *
from ( -- collect minimal/maximal reachable id for each edge in G
select distinct
inter.id1
, inter.id2
, min(minelt) chainstart
, max(maxelt) chainend
from ( -- collect minimum/maximum terminal id for chains starting/ending with any of G's edges
select distinct
inter_min.id1
, inter_min.id2
, inter_min.minelt
, inter_max.maxelt
from ( -- hierarchical query for maximal chains in D_1
select distinct
hup.ID1
, hup.ID2
, CONNECT_BY_ROOT hup.id2 maxelt
from test_unique_pairs hup
connect by nocycle prior hup.id1 = hup.id2
start with hup.id2 in (
select ref1.id2
from test_unique_pairs ref1
where not exists (
select 1
from test_unique_pairs ref2
where ref2.id1 = ref1.id2
)
)
) inter_max
join ( -- hierarchical query for maximal chains in D_2 ( = D_1 with reversed edge orientations )
select distinct
hup.ID1
, hup.ID2
, CONNECT_BY_ROOT hup.id1 minelt
from test_unique_pairs hup
connect by nocycle prior hup.id2 = hup.id1
start with hup.id1 in (
select ref1.id1
from test_unique_pairs ref1
where not exists (
select 1
from test_unique_pairs ref2
where ref2.id2 = ref1.id1
)
)
) inter_min
on ( inter_min.id1 = inter_max.id1 AND inter_min.id2 = inter_max.id2 )
) inter
group by grouping sets ( (inter.id1, inter.id2) )
) base
order by base.id1
, base.id2
;
Next the edges of B
are coalesced into WCCs and their respective minimal ids extracted. The following hierarchical query, lacking a STARTS WITH
clause, builds the complete transitive closure wrt the edge incidence relation, and for each edge in B
produces root paths to all other edges in the same WCC. Thus minimizing over the start vertex of all hierarchy 'root's given an edge in G
finds the minimal reachable id in G
. Cycles are suppressed by the NOCYCLE
directive.
select *
from (
select distinct
id1
, id2
, min(chainlim) chainrep
from (
select distinct
id1
, id2
, connect_by_root chainstart chainlim
from test_up_chains
connect by nocycle ( ( prior chainstart = chainstart and prior chainend != chainend ) OR ( prior chainstart != chainstart and prior chainend = chainend ) )
) base
group by grouping sets ( ( base.id1, base.id2 ) )
)
order by id1
, id2
;