How can I cluster a graph in Python?

Question

Let G be a graph. So G is a set of nodes and set of links. I need to find a fast way to partition the graph. The graph I am now working has only 120*160 nodes, but I might soon be working on an equivalent problem, in another context (not medicine, but website development), with millions of nodes.

So, what I did was to store all the links into a graph matrix:

M=numpy.mat(numpy.zeros((len(data.keys()),len(data.keys()))))

Now M holds a 1 in position s,t, if node s is connected to node t. I make sure M is symmetrical M[s,t]=M[t,s] and each node links to itself M[s,s]=1.

If I remember well if I multiply M with M, the results is a matrix that represents the graph that connects vertexes that are reached on through two steps.

So I keep on multplying M with itself, until the number of zeros in the matrix do not decrease any longer. Now I have the list of the connected components. And now I need to cluster this matrix.

Up to now I am pretty satisfied with the algorithm. I think it is easy, elegant, and reasonably fast. I am having trouble with this part.

Essentially I need to split this graph into its connected components.

I can go through all the nodes, and see what are they connected to.

But what about sorting the matrix reordering the lines. But I don't know if it is possible to do it.

What follows is the code so far:

def findzeros(M):
    nZeros=0
    for t in M.flat:
        if not t:
            nZeros+=1
    return nZeros

M=numpy.mat(numpy.zeros((len(data.keys()),len(data.keys()))))    
for s in data.keys():
    MatrixCells[s,s]=1
    for t in data.keys():
        if t<s:
            if (scipy.corrcoef(data[t],data[s])[0,1])>threashold:
                M[s,t]=1
                M[t,s]=1

nZeros=findzeros(M)
M2=M*M
nZeros2=findzeros(M2)

while (nZeros-nZeros2):
    nZeros=nZeros2
    M=M2
    M2=M*M
    nZeros2=findzeros(M2)

Edit:

It has been suggested that I use SVD decomposition. Here is a simple example of the problem on a 5x5 graph. We shall use this since with the 19200x19200 square matrix is not that easy to see the clusters.

import numpy
import scipy

M=numpy.mat(numpy.zeros((5,5)))

M[1,3]=1
M[3,1]=1
M[1,1]=1
M[2,2]=1
M[3,3]=1
M[4,4]=1
M[0,0]=1

print M

u,s,vh = numpy.linalg.linalg.svd(M)
print u
print s
print vh

Essentially there are 4 clusters here: (0),(1,3),(2),(4) But I still don't see how the svn can help in this context.

Could you clarify your question. I found an "is it possible" (which is always answered Yes, so that can't be your real question) and a "I don't see how SVD can help" which isn't a real question. What's your question? — S.Lott, Mar 17 '09 at 10:55
Hello, thanks for spending the time on my question. The question, made explicit is: "How should I determine the connected components?" I thought you understood it, and where just having some innocent fun. — Pietro Speroni, Mar 17 '09 at 11:09
@Pietro Speroni: Consider rewriting your question to make it simpler, more focused and more clear. A long discussion is hard to follow. Short code examples and a very obvious question is better. You provide some code, so asking "how should I determine..?" doesn't seem right. — S.Lott, Mar 17 '09 at 16:50
Thank you, but since I did receive the answer I was looking for, and since other users seemed to understand the question well enough, I think I shall stick with this. Regards, Pietro — Pietro Speroni, Mar 18 '09 at 10:30

score 14 · Accepted Answer · edited May 31 '19 at 04:34

14

Why not use a real graph library, like Python-Graph? It has a function to determine connected components (though no example is provided). I'd imagine a dedicated library is going to be faster than whatever ad-hoc graph code you've cooked up.

EDIT: NetworkX seems like it might be a better choice than python-graph; its documentation (here for the connected components function) certainly is.

edited May 31 '19 at 04:34

amijjm

3
2

answered Mar 17 '09 at 10:50

kquinn

10,433
4
35
35

Thank you! Looks like a great resource. I shall investigate it thoroughly. – Pietro Speroni Mar 17 '09 at 11:02

vartec · Answer 2 · 2009-03-17T09:43:56.183

7

In SciPy you can use sparse matrices. Also note, that there are more efficient ways of multiplying matrix by itself. Anyway, what you're trying to do can by done by SVD decomposition.

Introduction with useful links.

edited Mar 17 '09 at 09:43

answered Mar 17 '09 at 09:37

vartec

131,205
36
218
244

Thank you. I looked up the resource, but I honestly do not see how it can help. I updated the question with a simple example, and how SVN des not seem to solve it. Or then maybe I am using it wrongly? But how then? Thanks in any case :) – Pietro Speroni Mar 17 '09 at 10:16
That's SVD (Singlular Value Decomposition). Basically for something as large as millions of nodes, you'll need approximation algorithm, rather than exact one (graph clustering is NP-complete). Article got links to papers explaining such algorithms. – vartec Mar 17 '09 at 11:04
BTW. are you trying to reinvent PageRank or HITS? – vartec Mar 17 '09 at 11:06
Not really. Right now just sorting which data belong to which biological cell. In fuure I have an equivalent problem that will eventually generate a search engine. But not on pages. And not using links. (Can't say more at this stage :) ). In any case, congratulations! Well spotted, LOL. – Pietro Speroni Mar 17 '09 at 11:57
Latent Semantic Analysis then? ;-) Ok, I'm not going to pull your tongue. Just keep in mind, that what is possible in small scale, gets really complicated when it's big. Most graph algorithms have hight polynomial complexity, so it's to fissile to use then on 1mln nodes. – vartec Mar 17 '09 at 12:01

score 4 · Answer 3 · answered Apr 26 '15 at 19:37

There's also graph_tool and networkit that have efficient routines for connected components, and both store the network efficiently. If you're going to work with millions of nodes, networkx will likely not be sufficient (it's pure python afaik). Both those tools are written in C++ so can handle analysis of large graphs with reasonable run times.

As Phil points out, your method will have horribly long compute times for large graphs (we're talking days, weeks, months...), and your representation for a graph of a million nodes will need something like a million gigabytes of memory!

lynxoid · Answer 4 · 2013-10-28T01:20:13.283

3

Finding an optimal graph partition is an NP-hard problem, so whatever the algorithm, it is going to be an approximation or a heuristic. Not surprisingly, different clustering algorithms produce (wildly) different results.

Python implementation of Newman's modularity algorithm: modularity

Also: MCL, MCODE, CFinder, NeMo, clusterONE

edited Oct 28 '13 at 01:20

answered Sep 21 '11 at 22:25

lynxoid

509
6
14

score 2 · Answer 5 · answered Mar 17 '09 at 14:57

Here's some naive implementation, which finds the connected components using depth first search, i wrote some time ago. Although it's very simple, it scales well to ten thousands of vertices and edges...


import sys
from operator import gt, lt

class Graph(object):
    def __init__(self):
        self.nodes = set()
        self.edges = {}
        self.cluster_lookup = {}
        self.no_link = {}

    def add_edge(self, n1, n2, w):
        self.nodes.add(n1)
        self.nodes.add(n2)
        self.edges.setdefault(n1, {}).update({n2: w})
        self.edges.setdefault(n2, {}).update({n1: w})

    def connected_components(self, threshold=0.9, op=lt):
        nodes = set(self.nodes)
        components, visited = [], set()
        while len(nodes) > 0:
            connected, visited = self.dfs(nodes.pop(), visited, threshold, op)
            connected = set(connected)
            for node in connected:
                if node in nodes:
                    nodes.remove(node)

            subgraph = Graph()
            subgraph.nodes = connected
            subgraph.no_link = self.no_link
            for s in subgraph.nodes:
                for k, v in self.edges.get(s, {}).iteritems():
                    if k in subgraph.nodes:
                        subgraph.edges.setdefault(s, {}).update({k: v})
                if s in self.cluster_lookup:
                    subgraph.cluster_lookup[s] = self.cluster_lookup[s]

            components.append(subgraph)
        return components

    def dfs(self, v, visited, threshold, op=lt, first=None):
        aux = [v]
        visited.add(v)
        if first is None:
            first = v
        for i in (n for n, w in self.edges.get(v, {}).iteritems()
                  if op(w, threshold) and n not in visited):
            x, y = self.dfs(i, visited, threshold, op, first)
            aux.extend(x)
            visited = visited.union(y)
        return aux, visited

def main(args):
    graph = Graph()
    # first component
    graph.add_edge(0, 1, 1.0)
    graph.add_edge(1, 2, 1.0)
    graph.add_edge(2, 0, 1.0)

    # second component
    graph.add_edge(3, 4, 1.0)
    graph.add_edge(4, 5, 1.0)
    graph.add_edge(5, 3, 1.0)

    first, second = graph.connected_components(op=gt)
    print first.nodes
    print second.nodes

if __name__ == '__main__':
    main(sys.argv)

score 2 · Answer 6 · answered Mar 18 '09 at 03:39

2

As others have pointed out, no need to reinvent the wheel. A lot of thought has been put into optimal clustering techniques. Here is one well-known clustering program.

answered Mar 18 '09 at 03:39

RexE

17,085
16
58
81

score 0 · Answer 7 · edited May 23 '17 at 12:34

0

Looks like there is a library PyMetis, which will partition your graph for you, given a list of links. It should be fairly easy to extract the list of links from your graph by passing it your original list of linked nodes (not the matrix-multiply-derived one).

Repeatedly performing M' = MM will not be efficient for large orders of M. A full matrix-multiply for matrices of order N will cost N multiplications and N-1 additions per element, of which there are N², that is O(N³) operations. If you are scaling that to "millions of nodes", that would be O(10¹⁸) operations per matrix-matrix multiplication, of which you want to do several.

In short, you don't want to do it this way. The SVD suggestion from Vartec would be the only appropriate choice there. Your best option is just to use PyMetis, and not try to reinvent graph-partitioning.

edited May 23 '17 at 12:34

Community

1
1

answered Mar 17 '09 at 11:34

Phil H

19,928
7
68
105

Thanks. I admit the SVD suggestion totally went over my head. I am aware that graph partitioning is a well studied problem, so I was hoping to get some good insights when I posted here. But I also wanted to write what I knew, to show my good will :-) – Pietro Speroni Mar 17 '09 at 12:01
I think the key is to decide whether you want to learn about partitioning enough to rewrite the software on it (probably not), or whether you just want to partition a graph. If you decide just to use existing solutions, pick a library and use it. Seek to solve it at the highest level. – Phil H Mar 17 '09 at 12:31
I tried to install PyMetis, but it seem to have a hard time in installing. There seem to be no configuration file. Looking for the easiest way out I shall instead install networkx. Thanks, Pietro – Pietro Speroni Apr 03 '09 at 14:41

score -1 · Answer 8 · answered Mar 18 '09 at 02:28

-1

The SVD algorithm is not applicable here, but otherwise Phil H is correct.

answered Mar 18 '09 at 02:28

How can I cluster a graph in Python?

Edit:

8 Answers8