Extract single linkage clusters from very large pairs list

Question

I have a very large pairs list that I need to break down into single linkage communities. So far I have been able to do this entirely in R just fine. But I need to prepare for the eventuality that the entire list may be too large to hold in memory, or for igraph's R implementation to handle. A very simple version of this task looks like:

library(igraph)
df <- data.frame("p1" = c("a", "a", "d", "d"),
                 "p2" = c("b", "c", "e", "f"),
                 "val" = c(0.5, 0.75, 0.25, 0.35))
g <- graph_from_data_frame(d = df,
                           directed = FALSE)

sg <- groups(components(g))
sg <- sapply(sg,
             function(x) induced_subgraph(graph = g,
                                          vids = x),
             USE.NAMES = FALSE,
             simplify = FALSE)

if df is incredibly large - on the scale of hundreds of millions, to tens of billions of rows, is there a way for me to extract individual positions of sg without having to build g in it's entirety? It's relatively easy for me to store representations of df outside of R either as a compressed txt file or as a sqlite database.

The first thing I would try is compressing your data set: use integers to represent your nodes and ditch the edge weights as you don't seem to be using them to compute the linkage. If your network still does not fit into memory, then you have to do it out of memory but you won't be able to do it in R/igraph; example implementation [here](https://stackoverflow.com/a/18382582/2912349). — Paul Brodersen, Jan 25 '21 at 14:32

pookpash · Accepted Answer · 2021-01-25T19:20:48.160

To adress the problem with igraph's R implementation (assuming the dataset is still holdable in RAM, otherwise see @Paul Brodersen's answer):

The solution below works by specifying one element of the graph and then going over all connections until no further edges are found. It therefore creates the subgraph without building the whole graph. It looks a bit hacky compared to a recursive function but scales better.

library(igraph)    
reduce_graph <- function(df, element) {
        stop = F
        elements_to_inspect <- element
        rows_graph <-0
        while(stop ==F) {
            graph_parts <- df[df$p1 %in% elements_to_inspect | 
                                  df$p2 %in% elements_to_inspect,]
            elements_to_inspect <- unique(c(unique(graph_parts$p1), 
                                            unique(graph_parts$p2)))
            if(dim(graph_parts)[1] == rows_graph) {
                stop <-TRUE
            } else {
                rows_graph <- dim(graph_parts)[1]
            }
        }
        return(graph_parts)
    }

df <- data.frame("p1" = c("a", "a", "d", "d","o"),
                 "p2" = c("b", "c", "e", "f","u"),
                 "val" = c(100, 0.75, 0.25, 0.35,1))

small_graph <- reduce_graph(df, "f")
g <- graph_from_data_frame(d = small_graph,
                           directed = FALSE)

sg <- groups(components(g))
sg <- sapply(sg,
             function(x) induced_subgraph(graph = g,
                                          vids = x),
             USE.NAMES = FALSE,
             simplify = FALSE)

One can test the speed on a bigger dataset.

##larger dataset with lots of sparse graphs.
set.seed(100)
p1 <- as.character(sample(1:10000000, 1000000, replace=T))
p2 <- as.character(sample(1:10000000, 1000000, replace=T))
val <- rep(1, 1000000)
df <- data.frame("p1" = p1,
                 "p2" = p2,
                 "val" = val)

small_graph <- reduce_graph(df, "9420672") #has 3 pairwise connections
g <- graph_from_data_frame(d = small_graph,
                           directed = FALSE)

sg <- groups(components(g))
sg <- sapply(sg,
             function(x) induced_subgraph(graph = g,
                                          vids = x),
             USE.NAMES = FALSE,
             simplify = FALSE)

Building groups and subgraph takes one second, compared to multiple minutes for the whole graph on my machine. This of course depends on how sparsely connected the graphs are.

Extract single linkage clusters from very large pairs list

1 Answers1