Group list of lists by shared elements

Question

Assume I have the following list of sublists:

l = [['a', 'b'], 
 ['a', 'c'], 
 ['b', 'c'],
 ['c', 'd'],  
 ['e', 'f'], 
 ['f', 'g'], 
 ['x', 'y']]

My goal is to rearrange that list into "buckets" in a way such that each sublist in the bucket shares an element with at least one other sublist in the bucket and shares no element with any sublist in a different bucket. It is a little hard understand this in words, but in this case, the desired result would be:

result = [
    [
        ['a', 'b'],
        ['a', 'c'],
        ['b', 'c'],
        ['c', 'd']
    ],
    [
        ['e', 'f'],
        ['f', 'g']
    ],
    [
        ['x', 'y']   
    ],
]

The idea here is that ['a','b'] goes into the Bucket 1. ['a','b'] shares elements with ['a', 'c'] and ['b', 'c'], so those go into Bucket 1 as well. Now ['c', 'd'] also shares an element c with the elements currently in Bucket 1, so it gets added to the Bucket 1 as well. After that, there are no more sublists with elements that are shared with those in Bucket 1, so we open a new Bucket 2, starting with ['e', 'f']. ['e', 'f'] shares an element with ['f', 'g'], so that goes into Bucket 2 as well. Then we are done with Bucket 2. ['x', 'y'] gets its own Bucket 3.

I know how to do all of this recursively, but l is very large, and I am wondering whether there is a quicker way to group the elements together!

This is essentially the problem of finding strongly connected components in an undirected graph. You can think of each pair of strings as edges in a graph. First identify the components, then create a list for each component and assign the pairs to the appropriate list. — Tom Karzes, Nov 07 '20 at 00:45
Does this answer your question? [How to find Strongly Connected Components in a Graph?](https://stackoverflow.com/questions/33590974/how-to-find-strongly-connected-components-in-a-graph) — Serial Lazer, Nov 07 '20 at 00:46
@TomKarzes Note that "strongly connected component" is a term that makes sense in directed graphs. For an undirected graph, we just say "connected component". — Stef, Nov 07 '20 at 01:04
In my opinion it's a different question, just because a problem reduces to another doesn't mean they are the same. Nobody stores a graph this way so it's very unlikely that an existing answer for connected components discovery would solve this particular problem. You will have to write code to get the desired output, therefore its not the same problem. — cglacet, Nov 07 '20 at 10:48

score 1 · Answer 1 · answered Nov 07 '20 at 01:12

This code seems to work:

l = [
 ['a', 'b'], 
 ['a', 'c'], 
 ['b', 'c'],
 ['c', 'd'],  
 ['e', 'f'], 
 ['f', 'g'], 
 ['x', 'y']]
 
l2 = []

# merge lists to sets
for x in l:
  for x2 in l2:
     if len(x2 & set(x)):
         x2 |= set(x)
         break
  else:
     l2.append(set(x))

# output lists
d = {i:[] for i in range(len(l2))}

# match each list to set
for x in l:
  for k in d:
    if len(set(x) & set(l2[k])):
       d[k].append(x) 

# merge dictionary values
fl = [v for v in d.values()]

print(fl)

Output

[[['a', 'b'], 
  ['a', 'c'], 
  ['b', 'c'], 
  ['c', 'd']], 
 [['e', 'f'], 
  ['f', 'g']], 
 [['x', 'y']]]

cglacet · Answer 2 · 2020-11-07T11:00:33.007

Here is an alternative using the suggested reduction to a graph problem. I hope the code is clear enough, I'll still add a few explanations.

Convert to a list of adjacency

Just because it's easier to work with:

from collections import defaultdict

edges = [
    ['a', 'b'], 
    ['a', 'c'], 
    ['b', 'c'],
    ['c', 'd'],  
    ['e', 'f'], 
    ['f', 'g'], 
    ['x', 'y'],
]

def graph_from_edges(edge):
    graph = defaultdict(set)
    for u, v in edges:
        graph[u].add(v)
        graph[v].add(u)
    return graph

graph = graph_from_edges(edges)

The graph now contains:

{
    'a': {'c', 'b'}, 
    'b': {'c', 'a'}, 
    'c': {'d', 'b', 'a'}, 
    'd': {'c'}, 
    'e': {'f'}, 
    'f': {'e', 'g'}, 
    'g': {'f'}, 
    'x': {'y'}, 
    'y': {'x'}
}

Find the connected component of a given node

This is a simpler sub-problem to solve, we give a node and explore the graph nearby until we only have visited node left available:

def connected_component_from(graph, starting_node):
    nodes = set(starting_node)
    visited = set()
    while nodes:
        node = nodes.pop()
        yield node
        visited.add(node)
        nodes |= graph[node] - visited

print(list(connected_component_from(graph, 'a')))

This prints the list of nodes in the connected component of node 'a':

['a', 'b', 'c', 'd']

Finding all connected components

Now we just need to repeat the previous operation until we have visited all nodes in the graph. To discover new unexplored components we simply pick a random unvisited node to start over:

def connected_components(graph):
    all_nodes = set(graph.keys())
    visited = set() 
    while all_nodes - visited:
        starting_node = random_node(all_nodes - visited)
        connected_component = set(connected_component_from(graph, starting_node))
        yield connected_component
        visited |= connected_component

def random_node(nodes):
    return random.sample(nodes, 1)


graph_cc = list(connected_components(graph))
print(graph_cc)

Which prints:

[{'a', 'c', 'd', 'b'}, {'g', 'e', 'f'}, {'y', 'x'}]

Shortcut

You could also use an existing library to compute these connected components for you, for example networkx:

import networkx as nx

G = nx.Graph()

G.add_edges_from(edges)
cc = list(nx.connected_components(G))
print(graph_cc)

Which also prints:

[{'a', 'c', 'd', 'b'}, {'g', 'e', 'f'}, {'y', 'x'}]

In practice that would be the best solution, but that's less interesting if the goal is to learn new things. Notice that you can view networkx implementation of the function (which uses this BFS)

Going back to the original problem

We managed to find nodes from the same connected component, but that's not what you wanted, so we need to get original lists back. To do this a bit faster on large graphs, one possibility is to first have a map from node names to their connected component index in the previous list:

node_cc_index = {u: i for i, cc in enumerate(graph_cc) for u in cc}
print(node_cc_index)

Which gives:

{'g': 0, 'e': 0, 'f': 0, 'a': 1, 'c': 1, 'd': 1, 'b': 1, 'y': 2, 'x': 2}

We can use that to fill the list of edges split as you first requested:

edges_groups = [[] for _ in graph_cc]
for u, v in edges:
    edges_groups[node_cc_index[u]].append([u, v])

print(edges_groups)

Which finally gives:

[
    [['e', 'f'], ['f', 'g']], 
    [['a', 'b'], ['a', 'c'], ['b', 'c'], ['c', 'd']], 
    [['x', 'y']]
]

Each sublist conserves the original order, but the order between lists is not preserved in any way (its a direct results from the random choice we made). To avoid this, if its a problem, we could just replace the random pick by picking the "first" unvisited node.

JBN · Accepted Answer · 2020-11-07T04:31:37.910

Thanks for the suggestions, everyone, I guess I just needed the right vocabulary! Since under the linked answers, a couple of people asked for code to implement all of this, I thought I'd post an answer for future reference. Apparently, the concept of strongly connected components is not defined for non-directed graphs, so the solution is to look for connected components.

For my answer, I adjusted the code found here: https://www.geeksforgeeks.org/connected-components-in-an-undirected-graph/

It just requires reformulating l has integers, rather than strings:

class Graph:
    # init function to declare class variables
    def __init__(self, V):
        self.V = V
        self.adj = [[] for i in range(V)]
 
    def DFSUtil(self, temp, v, visited):
 
        # Mark the current vertex as visited
        visited[v] = True
 
        # Store the vertex to list
        temp.append(v)
 
        # Repeat for all vertices adjacent
        # to this vertex v
        for i in self.adj[v]:
            if visited[i] == False:
 
                # Update the list
                temp = self.DFSUtil(temp, i, visited)
        return temp
 
    # method to add an undirected edge
    def addEdge(self, v, w):
        self.adj[v].append(w)
        self.adj[w].append(v)
 
    # Method to retrieve connected components
    # in an undirected graph
    def connectedComponents(self):
        visited = []
        cc = []
        for i in range(self.V):
            visited.append(False)
        for v in range(self.V):
            if visited[v] == False:
                temp = []
                cc.append(self.DFSUtil(temp, v, visited))
        return cc

Now we can run

l = [[0, 1], 
 [0, 2], 
 [1, 2],
 [2, 3],  
 [4, 5], 
 [5, 6], 
 [7, 8]]


g = Graph(
    max([item for sublist in l for item in sublist])+1
)

for sl in l:
    g.addEdge(sl[0], sl[1])
cc = g.connectedComponents()
print("Following are connected components")
print(cc)

And we get:

Following are connected components
[[0, 1, 2, 3], [4, 5, 6], [7, 8]]

We can then go back and group the original list:

result = []
for sublist in cc:
    bucket = [x for x in l if any(y in x for y in sublist)]
    result.append(bucket)

Output:

[[[0, 1], [0, 2], [1, 2], [2, 3]], [[4, 5], [5, 6]], [[7, 8]]]