I have two data frames. First, a lookup table comprising a list of vertex names:
lookup <- data.frame(Name=c("Bob","Jane"))
Then I have an edge list that looks like this:
edges <- data.frame(vertex1 = c("Bob","Bill","Bob","Jane","Bill","Jane","Bob","Jane","Bob","Bill","Bob"
,"Jane","Bill","Jane","Bob","Jane","Jane","Jill","Jane","Susan","Susan"),
edgeID = c(1,1,1,1,1,1,2,2,1,1,1,1,1,1,2,2,3,3,3,3,3),
vertex2 = c("Bill","Bob","Jane","Bob","Jane","Jill","Jane","Bob","Bill","Bob"
,"Jane","Bob","Jane","Bill","Jane","Bob","Jill","Jane","Susan","Jane","Jill"))
For each unique vertex in the "lookup" table, I'd like to iterate through the "edges" table and label every edgeID where lookup$Name is among the vertices.
I can do that with the following script:
library(igraph)
g <- graph_from_data_frame(edges[c(1, 3, 2)], directed = FALSE)
do.call(
rbind,
c(
make.row.names = FALSE,
lapply(
as.character(lookup$Name),
function(nm) {
z <- c(nm, V(g)$name[distances(g, nm) == 1])
cbind(group = nm, unique(subset(edges, vertex1 %in% z & vertex2 %in% z)))
}
)
)
)
group vertex1 edgeID vertex2
1 Bob Bob 1 Bill
2 Bob Bill 1 Bob
3 Bob Bob 1 Jane
4 Bob Jane 1 Bob
5 Bob Bill 1 Jane
6 Bob Bob 2 Jane
7 Bob Jane 2 Bob
8 Bob Jane 1 Bill
9 Jane Bob 1 Bill
10 Jane Bill 1 Bob
11 Jane Bob 1 Jane
12 Jane Jane 1 Bob
13 Jane Bill 1 Jane
14 Jane Jane 1 Jill
15 Jane Bob 2 Jane
16 Jane Jane 2 Bob
17 Jane Jane 1 Bill
18 Jane Jane 3 Jill
19 Jane Jill 3 Jane
20 Jane Jane 3 Susan
21 Jane Susan 3 Jane
22 Jane Susan 3 Jill
The problem is that this seems inefficient for large edge lists. In my real data, "lookup" has 3,263 observations while "edges" has 167,775,170 observations. I've attempted to run the script above on an Amazon EC2 instance with 16 cores and 100GB or RAM for two days now with no end in sight (using "future_lapply" instead of "lapply" to allow for parallel processing). Is there any way that I can make this more efficient/faster?
This won't be the only time I need to group edges like this and I'm hoping to find a way to do it that isn't so expensive in terms of time and Amazon bills.