7

I have 4,000,000,000 (four billion) edges for an undirected graph. They are represented in a large text file as pairs of node ids. I would like to compute the connected components of this graph. Unfortunately, once you load in the node ids with the edges into memory this takes more than the 128GB of RAM I have available.

Is there an out of core algorithm for finding connected components that is relatively simple to implement? Or even better, can be it cobbled together with Unix command tools and existing (python) libraries?

Philipp Claßen
  • 41,306
  • 31
  • 146
  • 239
Simd
  • 19,447
  • 42
  • 136
  • 271

2 Answers2

4

Based on the description of the problem you've provided and the answers you provided in the comments, I think the easiest way to do this might be to use an approach like the one @dreamzor described. Here's a more fleshed-out version of that answer.

The basic idea is to convert the data to a more compressed format that fits into memory, to run a regular connected components algorithm on that data, then to decompress it. Notice that if you assign each node a 32-bit numeric ID, then the total space required to store all the nodes is at most the space for four billion nodes and eight billion edges (assuming you store two copies of each edge), which is space for twelve billion 32-bit integers, only around 48GB of space, below your memory threshold.

To start off, write a script that reads in the edges file, assigns a numeric ID to each node (perhaps sequentially in the order in which they appear). Have this script write this mapping to a file and, as it goes, write a new edges file that uses the numeric IDs of the nodes rather than the string names. When you're done, you'll have a names file mapping IDs to names and an edges file that takes up much less space than before. You mentioned in the comments that you can fit all the node names into memory, so this step should be very reasonable. Note that you don't need to store all the edges in memory - you can stream them through the program - so that shouldn't be a bottleneck.

Next, write a program that reads the edges file - but not the names file - into memory and finds connected components using any reasonable algorithm (BFS or DFS would be great here). If you're careful with your memory (using something like C or C++ here would be a good call), this should fit comfortably into main memory. When you're done, write out all the clusters to an external file by numeric ID. You now have a list of all the CCs by ID.

Finally, write a program that reads in the ID to node mapping from the names file, then streams in the cluster IDs and writes out the names of all the nodes in each cluster to a final file.

This approach should be relatively straightforward to implement because the key idea is to keep the existing algorithms you're used to but just change the representation of the graph to be more memory efficient. I've used approaches like this before in the past when dealing with huge graphs (Wikipedia) and it's worked beautifully even on systems with less memory than yours.

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
1

You can hold only an array of vertices as their "color" (an int value), then run through the file without loading the entire set of links, marking vertices with a color, a new one if neither vertice is colored, the same color if one is colored and the other isn't, and lowest of two colors, together with repainting all the other vertices in the array that are painted with the highest color if both are colored. A pseudocode example:

int nextColor=1;
int merges=0;
int[] vertices;
while (!file.eof()) {
    link=file.readLink();
    c1=vertices[link.a];
    c2=vertices[link.b];
    if ((c1==0)&&(c2==0)) {
        vertices[link.a]=nextColor;
        vertices[link.b]=nextColor;
        nextColor++;
    } else if ((c1!=0)&&(c2!=0)) {
        // both colored, merge
        for (i=vertices.length-1;i>=0;i--) if (vertices[i]==c2) vertices[i]=c1;
        merges++;
    } else if (c1==0) vertices[link.a]=c2; // only c1 is 0
    else vertices[link.b]=c1; // only c2 is 0
}

In case you choose the smaller than 32-bit type for storing color of a vertex, you might need to first check if nextColor is maxed, have an array of colors unused (released in merge), and skip coloring a new set of two vertices if no color can be used, then re-run the file reading process if both the colors are all used and any mergings occur.

UPDATE: Since the vertices aren't really ints but strings instead, you should also have a map of string to int while parsing that file. If your strings are limited by length, you can probably fit them all into memory as a hash table, but I'd pre-process the file by creating another file that would have all strings "s1" replaced with "1", "s2" with "2", etc, where "s1", "s2" are whatever names appear as vertices in the file, so that the data will be compacted to a list of pairs of ints. In case you'll be processing similar data later (that is, your graph isn't changing much, and contains largely the same names of vertices, store the "metadata" file with links from names to ints to ease further pre-processings.

Vesper
  • 18,599
  • 6
  • 39
  • 61