6

Goal: Want to efficiently find all the disconnected graphs from a large collection of sets

For example, I have a data file like the following:

A, B, C
C, D, E
A, F, Z
G, J
...

each entry represents a set of element. First entries A, B, C = {A, B, C} This also indicate that there is a edge between A and B, A and C, B and C.

The algorithm I initially came up with was the following

1.parse all the entries into a list:
[
{A,B,C}
{C,D,E}
...
]
2.start with the first element/set of the list can called start_entry, {A,B,C} in this case
3.traverse other element in the list and do the following:
     if the intersection of the element and start_entry is not empty
          start_entry = start_entry union with the element
          remove element from the list
4.with the updated start_entry, traverse the list again until there is not new update

The algorithm above should return a list of vertex of connected graph. Nevertheless, I ran into the runtime problem due to the dataset size. There is ~100000 entries. So I just wonder if anyone knows there is more efficient way to find connected graph.

The data structure could also be altered into (if this is easier) A,B B,C E,F ... with each entry represent an edge of graph.

inspectorG4dget
  • 110,290
  • 27
  • 149
  • 241
Junwei su
  • 138
  • 1
  • 10

3 Answers3

6

This looks like an ideal case for using a disjoint set data structure.

This lets you join together sets in almost linear time.

Example Python code

from collections import defaultdict

data=["A","B","C"],["C","D","E"],["F","G"]

# Prepare mapping from data element to index
S = {}
for a in data:
    for x in a:
        if x not in S:
            S[x] = len(S)

N = len(S)
rank=[0]*N
parent=range(N)

def Find(x):
    """Find representative of connected component"""
    if  parent[x] != x:
        parent[x] = Find(parent[x])
    return parent[x]

def Union(x,y):
    """Merge sets containing elements x and y"""
    x = Find(x)
    y = Find(y)
    if x == y:
        return
    if rank[x]<rank[y]:
        parent[x] = y
    elif rank[x]>rank[y]:
        parent[y] = x
    else:
        parent[y] = x
        rank[x] += 1

# Merge all sets
for a in data:
    x = a[0]
    for y in a[1:]:
        Union(S[x],S[y])

# Report disconnected graphs
V=defaultdict(list)
for x in S:
    V[Find(S[x])].append(x)

print V.values()

prints

[['A', 'C', 'B', 'E', 'D'], ['G', 'F']]
Peter de Rivaz
  • 33,126
  • 4
  • 46
  • 75
  • It may be simpler to just do a DFS or BFS. – user2357112 Jun 16 '17 at 20:01
  • 1
    true, although I often find I get stack overflow in Python when trying DFS or BFS and then have to convert the code into a non-recursive formulation – Peter de Rivaz Jun 16 '17 at 20:05
  • That said, kudos for actually using a proper disjoint-set forest with path compression and union by rank, rather than one of those non-working `set` and `union`-based things that pop up too frequently. – user2357112 Jun 16 '17 at 20:06
  • I think you might want `Find(S[x])` instead of `parent[S[x]]` when building the connected components at the end. – user2357112 Jun 16 '17 at 20:08
3

Use networkx which is a module specifically designed to handle graphs in an efficient way:

import networkx as nx
sets = [{'A','B','C'}, {'C','D','E'}, {'F','G','H'}, ...]

Create a graph and add SOME edges to it:

G = nx.Graph()
for s in sets:
    l = list(s)
    G.add_edges_from(zip(l, l[1:]))

Extract connected components ("disconnected graphs" in your terminology):

print(list(nx.connected_components(G)))
# [{'D', 'C', 'E', 'B', 'A'}, {'F', 'H', 'G'}]
DYZ
  • 55,249
  • 10
  • 64
  • 93
  • `itertools.combinations` is fine for 3-element sets, but it adds more edges than necessary, and a *lot* more edges than necessary if the sets get bigger. – user2357112 Jun 16 '17 at 20:35
-2

Have a look at the Rosetta Code task Set consolidation.

Given two sets of items then if any item is common to any set then the result of applying consolidation to those sets is a set of sets whose contents is:

The two input sets if no common item exists between the two input sets of items. The single set that is the union of the two input sets if they share a common item.

Given N sets of items where N>2 then the result is the same as repeatedly replacing all combinations of two sets by their consolidation until no further consolidation between set pairs is possible. If N<2 then consolidation has no strict meaning and the input can be returned.

Paddy3118
  • 4,704
  • 27
  • 38