Out of core connected components algorithms

Question

I have 4,000,000,000 (four billion) edges for an undirected graph. They are represented in a large text file as pairs of node ids. I would like to compute the connected components of this graph. Unfortunately, once you load in the node ids with the edges into memory this takes more than the 128GB of RAM I have available.

Is there an out of core algorithm for finding connected components that is relatively simple to implement? Or even better, can be it cobbled together with Unix command tools and existing (python) libraries?

I suppose you need simple connected components and not 'strongly' since the graph is undirected? — dreamzor, Jul 29 '16 at 13:49
Actually, it should be possible to store 4 billion edges to 128 GB of RAM. How do you store it? Not as an adjacency matrix I hope? Then, again, you need to do a simple BFS/DFS which also needs O(N) memory. — dreamzor, Jul 29 '16 at 13:52
@dreamzor Sadly it doesn't quite fit as a graph (not an adjacency matrix). I did try in igraph and networkx. And yes you need more memory to actually run the algorithm! — Simd, Jul 29 '16 at 13:54
what if they are all in one big connected component? Then that wouldn't fit in memory either. — RBarryYoung, Jul 29 '16 at 13:56
@RBarryYoung Well the idea is that you would output the connected component without actually storing it all in RAM. — Simd, Jul 29 '16 at 13:56
don't use python to read such a file... C++ is also your friend — FrankS101, Jul 29 '16 at 13:58
I also think that 4 billion edges should be ok on 128GB of ram. What about an adjacency-matrix stored as sparse-matrix? — sascha, Jul 29 '16 at 13:58
Well, it can be done, but it will be glacially slow. Pretty sure it'd be faster to buy then add more memory. — RBarryYoung, Jul 29 '16 at 14:00
One option is to just expand your virtual memory/pagefile and then use your existing algorithm, and let it page like heck. — RBarryYoung, Jul 29 '16 at 14:09
How are your data files structures? Do you just have the edge list? — templatetypedef, Jul 29 '16 at 14:35
@templatetypedef It's just plain text file where each line has two node ids saying those two nodes are connected. — Simd, Jul 29 '16 at 16:46
Do you have enough memory to hold all the names of the nodes in RAM, even if you can't store edges? — templatetypedef, Jul 29 '16 at 16:46
@templatetypedef Yes I do. In Python it takes 48GB it seems. — Simd, Jul 29 '16 at 17:03

score 4 · Answer 1 · answered Jul 29 '16 at 17:28

Based on the description of the problem you've provided and the answers you provided in the comments, I think the easiest way to do this might be to use an approach like the one @dreamzor described. Here's a more fleshed-out version of that answer.

The basic idea is to convert the data to a more compressed format that fits into memory, to run a regular connected components algorithm on that data, then to decompress it. Notice that if you assign each node a 32-bit numeric ID, then the total space required to store all the nodes is at most the space for four billion nodes and eight billion edges (assuming you store two copies of each edge), which is space for twelve billion 32-bit integers, only around 48GB of space, below your memory threshold.

To start off, write a script that reads in the edges file, assigns a numeric ID to each node (perhaps sequentially in the order in which they appear). Have this script write this mapping to a file and, as it goes, write a new edges file that uses the numeric IDs of the nodes rather than the string names. When you're done, you'll have a names file mapping IDs to names and an edges file that takes up much less space than before. You mentioned in the comments that you can fit all the node names into memory, so this step should be very reasonable. Note that you don't need to store all the edges in memory - you can stream them through the program - so that shouldn't be a bottleneck.

Next, write a program that reads the edges file - but not the names file - into memory and finds connected components using any reasonable algorithm (BFS or DFS would be great here). If you're careful with your memory (using something like C or C++ here would be a good call), this should fit comfortably into main memory. When you're done, write out all the clusters to an external file by numeric ID. You now have a list of all the CCs by ID.

Finally, write a program that reads in the ID to node mapping from the names file, then streams in the cluster IDs and writes out the names of all the nodes in each cluster to a final file.

This approach should be relatively straightforward to implement because the key idea is to keep the existing algorithms you're used to but just change the representation of the graph to be more memory efficient. I've used approaches like this before in the past when dealing with huge graphs (Wikipedia) and it's worked beautifully even on systems with less memory than yours.

Vesper · Answer 2 · 2016-07-29T14:13:16.730

You can hold only an array of vertices as their "color" (an int value), then run through the file without loading the entire set of links, marking vertices with a color, a new one if neither vertice is colored, the same color if one is colored and the other isn't, and lowest of two colors, together with repainting all the other vertices in the array that are painted with the highest color if both are colored. A pseudocode example:

int nextColor=1;
int merges=0;
int[] vertices;
while (!file.eof()) {
    link=file.readLink();
    c1=vertices[link.a];
    c2=vertices[link.b];
    if ((c1==0)&&(c2==0)) {
        vertices[link.a]=nextColor;
        vertices[link.b]=nextColor;
        nextColor++;
    } else if ((c1!=0)&&(c2!=0)) {
        // both colored, merge
        for (i=vertices.length-1;i>=0;i--) if (vertices[i]==c2) vertices[i]=c1;
        merges++;
    } else if (c1==0) vertices[link.a]=c2; // only c1 is 0
    else vertices[link.b]=c1; // only c2 is 0
}

In case you choose the smaller than 32-bit type for storing color of a vertex, you might need to first check if nextColor is maxed, have an array of colors unused (released in merge), and skip coloring a new set of two vertices if no color can be used, then re-run the file reading process if both the colors are all used and any mergings occur.

UPDATE: Since the vertices aren't really ints but strings instead, you should also have a map of string to int while parsing that file. If your strings are limited by length, you can probably fit them all into memory as a hash table, but I'd pre-process the file by creating another file that would have all strings "s1" replaced with "1", "s2" with "2", etc, where "s1", "s2" are whatever names appear as vertices in the file, so that the data will be compacted to a list of pairs of ints. In case you'll be processing similar data later (that is, your graph isn't changing much, and contains largely the same names of vertices, store the "metadata" file with links from names to ints to ease further pre-processings.

Out of core connected components algorithms

2 Answers2