Efficient storage and use of large sets of graphs in Python w/ networkX, the aim being inference across the graphs?

Question

The goal is as follows: I have the set of nodes, size N, for which I would like to

generate (as an iterable) a set of all possible directed graphs (presumably using networkx.DiGraph() and numpy.arange(0,N))(edit: to clarify — this is a very large number of graphs for even small N, so some kind of iterable structure is needed which allows the set to be constructed on the fly)
filter this set of graphs based on their meeting some criteria (e.g., the existence of a complete path from node 0 to node N-1)
store parameter values associated with each edge
compute functions across the filtered sets of graphs, taking into account the edge-associated parameter values in those functions without needing to have all the graphs in memory at the same time (as that would be really taxing without parallelization, and I don't know how to handle parallelization yet)

It seems like there has been plenty of discussion of dealing efficiently with graphs with many nodes and edges (i.e., large graphs) but very little discussion of how to effectively handle many graphs at once (large sets of graphs) where each graph can be expected to not overtax the system.

Edit: Ideally this would also work well with pandas and numpy arrays, but I'm pretty sure that any solution to the above can also be made to work with these since fundamentally networkX is working over dictionaries of dictionaries.

jme · Accepted Answer · 2015-01-13T17:40:37.980

You write that you'd like to generate all possible directed graphs with N nodes. Unfortunately, this isn't feasible even for small N.

Given a set of N nodes, the number of possible undirected edges is N(N-1)/2. We can define a graph by selecting a subset of these edges. There are 2^(N*(N-1)/2) possible subsets, meaning that there are exactly that many possible undirected graphs on N nodes.

Suppose N=10. There are roughly 3.5 * 10^13 possible graphs on these nodes. If you could handle one million graphs per second, you'd need roughly 10^7 seconds handle all of the graphs. This is on the order of a year.

And that's just undirected graphs. There are more directed graphs. For N nodes, there are 2^(N*(N-1)) directed graphs. Here's a table that demonstrates how fast this blows up. |V| is the number of nodes:

|V|   Number of Digraphs
===   ==================
1     1
2     4
3     64
4     4096
5     1048576
6     1073741824
7     4398046511104
8     72057594037927936
9     4722366482869645213696

If you'd really like to do this, here are python generators which will lazily enumerate graphs on a node set:

from itertools import chain
import networkx as nx

def power_set(iterable):
    """Return an iterator over the power set of the finite iterable."""
    s = list(iterable)
    return chain.from_iterable(combinations(s, n) for n in xrange(len(s) + 1))

def enumerate_graphs(nodes):
    # create a list of possible edges
    possible_edges = list(combinations(nodes, 2))

    # create a graph for each possible set of edges
    for edge_set in power_set(possible_edges):
        g = nx.Graph()
        g.add_nodes_from(nodes)
        g.add_edges_from(edge_set)
        yield g

def enumerate_digraphs(nodes):
    # create a list of possible edges
    possible_edges = list(combinations(nodes, 2))

    # generate each edge set
    for edge_set in power_set(possible_edges):
        # for each set of `M` edges there are `M^2` directed graphs
        for swaps in power_set(xrange(len(edge_set))):
            directed_edge_set = list(edge_set)
            for swap in swaps:
                u,v = directed_edge_set[swap]
                directed_edge_set[swap] = v,u
            g = nx.DiGraph()
            g.add_nodes_from(nodes)
            g.add_edges_from(directed_edge_set)
            yield g

We can then plot all of the directed graphs thusly:

nodes = ("apples", "pears", "oranges")

digraphs = list(enumerate_digraphs(nodes))
layout = nx.random_layout(digraphs[0])

plt.figure(figsize=(20,30))
for i,g in enumerate(digraphs):
    plt.subplot(6,5,i+1)
    nx.draw_networkx(g, pos=layout)
    plt.gca().get_xaxis().set_visible(False)
    plt.gca().get_yaxis().set_visible(False)

Enumeration of digraphs

I know — that's why I posed the question. I'm worried about how to avoid this problem of memory even for small N. It'd be easier if one could apply the filtering function and create an iterable object such that at no point all of the graphs need to be stored in memory, but that's exactly where I am stuck. — mpacer, Jan 09 '15 at 22:40
@mpacer A graph generator can be written -- but perhaps it shouldn't. How large of `N` are you considering? Generating all graphs with more than 7 nodes just isn't feasible, whether it occurs lazily or not. That's a sign that you might want to adopt a different approach, maybe even an approximation if possible. — jme, Jan 09 '15 at 22:45
For the particular problem — I'm actually interested in fairly small N (4), but I wanted to pose the question more generally so as to get a flavor of how people would handle the more intense version of the problem. With an idea of how to build one graph generator, I was hoping to solve other problems in the future where I don't need all possible graphs, but nonetheless a large set that can be generated algorithmically. — mpacer, Jan 09 '15 at 22:49
@mpacer Fair enough. I'll write up a graph generator and amend my answer in a moment. — jme, Jan 09 '15 at 22:50

Efficient storage and use of large sets of graphs in Python w/ networkX, the aim being inference across the graphs?

1 Answers1