1

I have two data frames. First, a lookup table comprising a list of vertex names:

lookup <- data.frame(Name=c("Bob","Jane"))

Then I have an edge list that looks like this:

edges <- data.frame(vertex1 = c("Bob","Bill","Bob","Jane","Bill","Jane","Bob","Jane","Bob","Bill","Bob"
                              ,"Jane","Bill","Jane","Bob","Jane","Jane","Jill","Jane","Susan","Susan"),
                  edgeID = c(1,1,1,1,1,1,2,2,1,1,1,1,1,1,2,2,3,3,3,3,3),
                  vertex2 = c("Bill","Bob","Jane","Bob","Jane","Jill","Jane","Bob","Bill","Bob"
                              ,"Jane","Bob","Jane","Bill","Jane","Bob","Jill","Jane","Susan","Jane","Jill"))

For each unique vertex in the "lookup" table, I'd like to iterate through the "edges" table and label every edgeID where lookup$Name is among the vertices.

I can do that with the following script:

library(igraph)

g <- graph_from_data_frame(edges[c(1, 3, 2)], directed = FALSE)
do.call(
  rbind,
  c(
    make.row.names = FALSE,
    lapply(
      as.character(lookup$Name),
      function(nm) {
        z <- c(nm, V(g)$name[distances(g, nm) == 1])
        cbind(group = nm, unique(subset(edges, vertex1 %in% z & vertex2 %in% z)))
      }
    )
  )
)
   group vertex1 edgeID vertex2
1    Bob     Bob      1    Bill
2    Bob    Bill      1     Bob
3    Bob     Bob      1    Jane
4    Bob    Jane      1     Bob
5    Bob    Bill      1    Jane
6    Bob     Bob      2    Jane
7    Bob    Jane      2     Bob
8    Bob    Jane      1    Bill
9   Jane     Bob      1    Bill
10  Jane    Bill      1     Bob
11  Jane     Bob      1    Jane
12  Jane    Jane      1     Bob
13  Jane    Bill      1    Jane
14  Jane    Jane      1    Jill
15  Jane     Bob      2    Jane
16  Jane    Jane      2     Bob
17  Jane    Jane      1    Bill
18  Jane    Jane      3    Jill
19  Jane    Jill      3    Jane
20  Jane    Jane      3   Susan
21  Jane   Susan      3    Jane
22  Jane   Susan      3    Jill

The problem is that this seems inefficient for large edge lists. In my real data, "lookup" has 3,263 observations while "edges" has 167,775,170 observations. I've attempted to run the script above on an Amazon EC2 instance with 16 cores and 100GB or RAM for two days now with no end in sight (using "future_lapply" instead of "lapply" to allow for parallel processing). Is there any way that I can make this more efficient/faster?

This won't be the only time I need to group edges like this and I'm hoping to find a way to do it that isn't so expensive in terms of time and Amazon bills.

Obed
  • 403
  • 3
  • 12
  • The logic is, you have e.g. exactly two v1=Bob, ID=1, v2=Jane, and one should be labeled Bob, the other Jane? – jay.sf Mar 01 '21 at 15:43
  • Imagine that the vertices are co-authors listed on a publication. The ID of the publication is the edgeID. So let's say that Publication1 has three authors: obed, jay.sf, and susan. We would have 3 rows in the edges table: obed--jay.sf, obed--susan, jay.sf--susan. Now, lets say that the "lookup" table has two entries: jay.sf, obed. We want to find every edge for every publication where jay.sf is an author, and every edge for every publication where obed is an author. In this example, our result will therefore be 6 rows (3 for obed, three for jay.sf) – Obed Mar 01 '21 at 17:14

1 Answers1

1

I think you can shrink your original data.frame edges first, then you can avoid using unique within lapply for each iteration.

The code below may speed up a bit, but not sure how it gains in your real data.

edges.unique <- unique(edges[c(1, 3, 2)])
g <- graph_from_data_frame(edges.unique, directed = FALSE)
do.call(
  rbind,
  c(
    make.row.names = FALSE,
    lapply(
      lookup$Name,
      function(nm) {
        z <- colnames(d <- distances(g, nm))[which(d < 2)]
        cbind(group = nm, subset(edges.unique, vertex1 %in% z & vertex2 %in% z))
      }
    )
  )
)

Update

edges.unique <- unique(
  transform(
    edges[c("vertex1", "vertex2", "edgeID")],
    vertex1 = ifelse(vertex1 < vertex2, vertex1, vertex2),
    vertex2 = ifelse(vertex1 < vertex2, vertex2, vertex1)
  )
)
g <- graph_from_data_frame(edges.unique, directed = FALSE)
res <- do.call(
  rbind,
  c(
    make.row.names = FALSE,
    lapply(
      lookup$Name,
      function(nm) {
        z <- colnames(d <- distances(g, nm))[which(d < 2)]
        cbind(group = nm, subset(edges.unique, vertex1 %in% z & vertex2 %in% z))
      }
    )
  )
)

gives

> res
   group vertex1 vertex2 edgeID
1    Bob    Bill     Bob      1
2    Bob     Bob    Jane      1
3    Bob    Bill    Jane      1
4    Bob     Bob    Jane      2
5   Jane    Bill     Bob      1
6   Jane     Bob    Jane      1
7   Jane    Bill    Jane      1
8   Jane    Jane    Jill      1
9   Jane     Bob    Jane      2
10  Jane    Jane    Jill      3
11  Jane    Jane   Susan      3
12  Jane    Jill   Susan      3

When you type plot(g), you will see the simplified as below enter image description here

ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
  • Thanks (and thank you for your original solution). This removes a few thousand rows. Inverse edges is another problem (e.g., obed--thomasiscoding, thomasiscoding--obed). I'm trying to remove those via the following it's taking a while to run.: edges <- edges %>% group_by(grp = paste(edgeID, pmax(vertex1, vertex2), pmin(vertex1, vertex2), sep = "_")) %>% slice(1) %>% ungroup() %>% select(-grp) – Obed Mar 01 '21 at 17:18
  • @Obed I updated my solution, where the bidirectional duplicates are removed from the the original `edges` to ensure each vertex pair unique. You can check the output. – ThomasIsCoding Mar 01 '21 at 23:08
  • 1
    This is fantastic, thank you! It reduces my edgelist from 167,775,170 to 83,208,852. It'll still take a while to run, no doubt, but this reduces the workload a lot. – Obed Mar 02 '21 at 14:20