14

With the parent-child relationships data frame as below:

  parent_id child_id
1         1        2
2         2        3
3         3        4

The goal is to achieve the following, i.e. expanded version of previous data frame where all the descendants (children, grandchildren, etc.) are assigned to each parent (and incl. the parent/child itself):

   parent_id child_id
1          1        1
2          1        2
3          1        3
4          1        4
5          2        2
6          2        3
7          2        4
8          3        3
9          3        4
10         4        4

The question I have: What is the fastest possible way (or one of them) of achieving that in R?

I've already tried various methods - from a for loop, SQL recursion to using igraph (as described here). They are all rather slow, and some of them are also prone to crashing when dealing with a larger number of combinations.

Below are examples with sqldf and igraph, benchmarked on a slightly larger data frame than above.

library(sqldf)
library(purrr)
library(dplyr)
library(igraph)

df <- data.frame(parent_id = 1:1000L)
df$child_id <- df$parent_id + 1L

# SQL recursion

sqlQuery <- 'with recursive
             dfDescendants (parent_id, child_id)
             as
             (select parent_id, child_id from df
             union all
             select d.parent_id, s.child_id from dfDescendants d
             join df s
             on d.child_id = s.parent_id)
             select distinct parent_id, parent_id as child_id from dfDescendants
             union
             select distinct child_id as parent_id, child_id from dfDescendants
             union
             select * from dfDescendants;'

sqldf(sqlQuery)

# igraph with purrr

df_g = graph_from_data_frame(df, directed = TRUE)

map(V(df_g), ~ names(subcomponent(df_g, .x, mode = "out"))) %>% 
  map_df(~ data.frame(child_id = .x), .id = "parent_id")

Benchmark (excl. query creation in sqldf and conversion to graph in igraph):

set.seed(23423)

microbenchmark::microbenchmark(
  sqldf = sqldf(sqlQuery),
  tidyigraph = map(V(df_g), ~ names(subcomponent(df_g, .x, mode = "out"))) %>% 
    map_df(~ data.frame(child_id = .x), .id = "parent_id"),
  times = 5
)

#    Unit: seconds
#           expr      min       lq     mean   median       uq      max neval
#          sqldf 7.815179 8.002836 8.113392 8.084038 8.315207 8.349701     5
#     tidyigraph 5.784239 5.806539 5.883241 5.889171 5.964906 5.971350     5
arg0naut91
  • 14,574
  • 2
  • 17
  • 38
  • It sounds like you effectively want a transitive closure. You may open a feature request for igraph. – Szabolcs Feb 07 '22 at 17:36
  • 1
    I am not an R user and I don't have the time to understand what you are doing here exactly, but here are some tips: (1) With igraph, building the graph (e.g. from the dataframe) can be costly. Avoid conversions. (2) `subcomponent()` should be fast once the graph is built, but currently any graph traversal has a setup cost in graph, thus repeated calls of `subcomponent` (for each vertex) are not ideal (3) You could do it in one go with the `ego` function, setting the `order` to the number of vertices. But I just looked into the C code, and it doesn't actually avoid the setup cost. – Szabolcs Feb 07 '22 at 17:49
  • That's because it is optimized for small orders and a small set of vertices. I can look into adding an optimized code path for the case of handling _all_ vertices. If you can do a bit of benchmarking first, that will be useful. – Szabolcs Feb 07 '22 at 17:51
  • 1
    `as_edgelist(connect(df_g, order=length(V(df_g)), mode="in"))` seems faster (but with these setting doesn't get the self loops) – user20650 Feb 07 '22 at 18:06
  • @Szabolcs, you mean something like `ego(df_g, order = length(V(df_g)), mode = "out")`, right? Quick check indicates it's 20x faster on the benchmark `df` from above (compared to `subcomponent`), which is great, but I assume this would need further checks at larger scales - I'll try to find time for that tomorrow. Note that the conversion to data frame is not included in the benchmark above; I'll post complete code for more transparency. – arg0naut91 Feb 07 '22 at 19:24
  • @user20650, seems to work but need to benchmark still - any way of including the self loops? – arg0naut91 Feb 07 '22 at 19:25
  • 1
    Yes. You can use `order=vcount(df_g)`. After having looked at the C source more carefully, I am actually quite surprised that it's that much faster. I opened an issue for trying to speed it up though: https://github.com/igraph/igraph/issues/1954 – Szabolcs Feb 07 '22 at 19:34
  • Around 220 milliseconds compared to 4500 for `subcomponent` (used `vcount` now). Seems quite fast, but as I said, will also follow up with more tests. Will be following the issue as well, thanks! It seems like a great solution overall - feel free to post a general igraph-based explanation and I accept it in case there'll be no other, faster solution. – arg0naut91 Feb 07 '22 at 19:44
  • @arg0naut91; the self-loops are trivial; if needed you can use `cbind(V(df_g), V(df_g))` then just `rbind` this to the edgelist – user20650 Feb 08 '22 at 11:44
  • @user20650, thanks - indeed a union would solve it then, same as for SQL. I thought there's another `igraph` configuration with your option that could bring it, though (and actually at first thought it's only the isolated child self-loops that are not taken care of). – arg0naut91 Feb 08 '22 at 12:04

2 Answers2

8

We can use ego like below

library(igraph)
df <- data.frame(parent_id = 1:3, child_id = 2:4)
g <- graph_from_data_frame(df)

setNames(
  rev(
    stack(
      Map(
        names,
        setNames(
          ego(g,
            order = vcount(g),
            mode = "out"
          ),
          names(V(g))
        )
      )
    )
  ),
  names(df)
)

which gives

   parent_id child_id
1          1        1
2          1        2
3          1        3
4          1        4
5          2        2
6          2        3
7          2        4
8          3        3
9          3        4
10         4        4

Benchmarking

set.seed(23423)

microbenchmark::microbenchmark(
  sqldf = sqldf(sqlQuery),
  tidyigraph = map(V(df_g), ~ names(subcomponent(df_g, .x, mode = "out"))) %>%
    map_df(~ data.frame(child_id = .x), .id = "parent_id"),
  ego = setNames(
    rev(
      stack(
        Map(
          names,
          setNames(
            ego(df_g,
              order = vcount(df_g),
              mode = "out"
            ),
            names(V(df_g))
          )
        )
      )
    ),
    names(df)
  ),
  times = 5
)

shows

Unit: milliseconds
       expr       min       lq      mean    median         uq        max neval
      sqldf 7156.2753 9072.155 9402.6904 9518.2796 10206.3683 11060.3738     5
 tidyigraph 2483.9943 2623.558 3136.7490 2689.8388  2879.5688  5006.7853     5
        ego  182.5941  219.151  307.2481  253.2171   325.8721   555.4064     5

The code using pipes for readability :

g |>
  ego(order = vcount(g), mode = "out") |>
  setNames(names(V(g))) |>
  Map(f = names) |>
  stack() |>
  rev() |>
  setNames(names(df))
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
  • The answer is accepted because it is relatively fast and based on a well-known library (and therefore likely to be robust). However, to be noted that `ancestor_descendant` function by @MartinMorgan is even faster and worth exploring (with necessary tests based on the actual use case, of course). – arg0naut91 Feb 11 '22 at 14:30
  • 2
    @arg0naut91 Thanks! Perhaps you can set a bounty for this question to attract more attention of those algorithmic or graph theory experts to provide better solutions than `ego`. – ThomasIsCoding Feb 11 '22 at 14:39
  • I'll consider that, thanks. – arg0naut91 Feb 11 '22 at 15:24
  • 1
    @arg0naut91 I set up a bounty, and hope more people could come to answer your question. – ThomasIsCoding Feb 23 '22 at 10:57
7

igraph is of course a good way to answer graph questions (see also Bioconductor's graph + RBGL packages), but I think this has an iterative solution in R.

It seems like a reasonable approach is to perform a depth-first graph traversal (I was expecting a fancier solution). This is actually quite easy to implement efficiently in R. Suppose vectors pid and cid describe the links between parent and child nodes in the graph (as in the data.frame in the question). Represent each node as a positive integer.

all_nodes <- unique(c(parent_id, child_id)  # all nodes
uid <- match(all_nodes, all_nodes)
pid <- match(parent_id, all_nodes)
cid <- match(child_id, all_nodes)

and form an edge list from each node to all its children.

edge_list <- unname(split(cid, factor(pid, levels = uid)))
edge_lengths <- lengths(edge_list)

The child nodes of the current child nodes is edge_list[cid], and the number of level 2 child nodes associated with each original parent node is rep(pid, edge_lengths[cid]). So the path from any node to any other reachable node is traversed by the simple iteration

while (length(pid)) {
    pid <- rep(pid, edge_lengths[cid])
    cid <- unlist(edge_list[cid])
}

@jblood94 points out that the traverse has to keep track of edges already visited. We can implement this efficiently (in time, not space!) by creating a logical vector of visited edges. We use a 'factory' pattern, where we create a function that retains state (the logical vector of nodes visited). The vector is indexed by the unique id (key) of the edge pid * n + cid. We're interested in keys that are not duplicated, and have not been already visited.

visitor <- function(uid, n_max = 3000) {
    n <- length(uid)
    if (n <= n_max) {
        ## over-allocate, to support `key = pid * n + cid`
        visited <- logical((n + 1L) * n) # FALSE on construction
    } else {
        stop("length(uid) greater than n_max = ", n_max)
    }
    function(pid, cid) {
        key <- pid * n + cid
        to_visit <- !(duplicated(key) | visited[key])
        visited[key[to_visit]] <<- TRUE  # update nodes that we will now visit
        to_visit
    }
}

Thus

> visit = visitor(1:10)
> visit(1:3, 2:4)
[1] TRUE TRUE TRUE
> visit(2:4, 3:5)
[1] FALSE FALSE  TRUE

Here is a more complete implementation of the entire solution, with additional book-keeping

visitor <- function(uid, n_max = 3000) {
    n <- length(uid)
    if (n <= n_max) {
        ## over-allocate, to support `key = pid * n + cid`
        visited <- logical((n + 1L) * n) # FALSE on construction
    } else {
        stop("length(uid) greater than n_max = ", n_max)
    }
    function(pid, cid) {
        key <- pid * n + cid
        to_visit <- !(duplicated(key) | visited[key])
        visited[key[to_visit]] <<- TRUE
        to_visit
    }
}

ancestor_descendant <- function(df) {
    ## encode parent and child to unique integer values
    ids <- unique(c(df$parent_id, df$child_id))
    uid <- match(ids, ids)
    pid <- match(df$parent_id, ids)
    cid <- match(df$child_id, ids)
    n <- length(uid)

    ## edge list of parent-offspring relationships, based on unique
    ## integer values; list is ordered by id, all ids are present, ids
    ## without children have zero-length elements. Use `unname()` so
    ## that edge_list is always indexed by integer
    edge_list <- unname(split(cid, factor(pid, levels = uid), drop = FALSE))
    edge_lengths <- lengths(edge_list)

    visit <- visitor(uid)
    keep <- visit(uid, uid) # all TRUE
    aid = did = list(uid) # results -- all uid's are there own ancestor / descendant
    i = 1L
   
    while (length(pid)) {
        ## only add new edges
        keep <- visit(pid, cid)
        ## record current generation ancestors and descendants
        pid <- pid[keep]
        cid <- cid[keep]
        i <- i + 1L
        aid[[i]] <- pid
        did[[i]] <- cid

        ## calculate next generation pid and cid.
        pid <- rep(pid, edge_lengths[cid])
        cid <- unlist(edge_list[cid])
    }
    ## decode results to original ids and clean up return value
    df <- data.frame(
        ancestor_id = ids[unlist(aid)],
        descendant_id = ids[unlist(did)]
    )
    df <- df[order(df$ancestor_id, df$descendant_id),]
    rownames(df) <- NULL
    df
}

This seems to be correct and performant, at least superficially

## Original example
df <- data.frame(parent_id = 1:1000L)
df$child_id <- df$parent_id + 1L
df = df[sample(nrow(df)),]
system.time(result <- ancestor_descendant(df))
##  user  system elapsed 
## 0.243   0.001   0.245 
dim(result)
## [1] 501501      2

## updated example from comments
df <- data.frame(parent_id = 1:1000L)
df$child_id <- df$parent_id + 1L
df <- rbind(df, data.frame(parent_id = 1000L, child_id = 1002L))
system.time(result <- ancestor_descendant(df))
##  user  system elapsed 
## 0.195   0.001   0.195 
dim(result)
## [1] 502502      2

## problematic case from @jblood94
df <- data.frame(
    parent_id=c(1, 1, 2),
    child_id = c(2, 3, 3)
)
ancestor_descendant(df)
##   ancestor_id descendant_id
## 1           1             1
## 2           1             2
## 3           1             3
## 4           2             2
## 5           2             3
## 6           3             3

## previously failed without filtering re-visited nodes
df <- data.frame(
    parent_id = rep(1:100, each = 2),
    child_id = c(2, rep(3:101, each = 2), 102)
)
system.time(result <- ancestor_descendant(df))
##  user  system elapsed 
## 0.005   0.000   0.006 
dim(result)
## [1] 5252    2
Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
  • This does seem to be fast, however it seems to fail with my proper use case (my actual example does seem to lack complexity to an extent). Consider adding `df <- rbind(df, data.frame(parent_id = 1000L, child_id = 1002L))` to the example; it will miss the full parent line of `1002`. – arg0naut91 Feb 08 '22 at 17:13
  • @arg0naut91 I had made an update earlier today, and added just now a final paragraph showing the result for your updated data frame. Is that what you were expecting, or...? – Martin Morgan Feb 08 '22 at 18:35
  • Unfortunately not - 1002 should be the descendant of 999 as well as of all of the previous nodes. – arg0naut91 Feb 08 '22 at 18:53
  • @arg0naut91 I've updated the function to be a little more correct, but also a little slower... – Martin Morgan Feb 08 '22 at 22:56
  • 1
    @arg0naut91 actually back to a cleaner implementation and faster performance. – Martin Morgan Feb 09 '22 at 09:41
  • This does seem to work - it's pretty fast as well, also compared to other (e.g. `igraph`) solutions. I'll leave the question open for a bit to see if there are other answers and to make some more tests, otherwise I accept it. – arg0naut91 Feb 09 '22 at 11:13
  • This blows up with other inputs. Try `df<- data.frame(parent_id = rep(1:30, each = 2), child_id = c(2, rep(3:31, each = 2), 32))`. Then try `dt <- data.frame(parent_id = rep(1:100, each = 2), child_id = c(2, rep(3:101, each = 2), 102))`. I doubt you'll beat the `igraph` solution by @ThomasIsCoding for all-around performance without going Rcpp and a expending a lot of effort. – jblood94 Feb 09 '22 at 16:51
  • @jblood94 yes, you're right, a robust solution would require a lot of effort, and would re-invent the wheel. Nonetheless I added a second version that addresses the problem you've identified, but at a significant performance hit (and a requirement for R-devel). – Martin Morgan Feb 09 '22 at 20:37
  • 1
    @jblood94 well I take at least some of that back. It seems like a reasonable solution is a depth-first traversal of the graph, starting from all nodes, provided that edges are not revisited. Edges can be indexed uniquely, and whether they are visited or not tracked efficiently (in time, not space) in a logical vector. For me my current R solution is slightly faster than the ego solution. – Martin Morgan Feb 10 '22 at 18:18
  • 1
    @MartinMorgan your method seems to be sound, fast and likely relevant for high % of actual use cases. It seems to be up to 10 times faster compared to `ego` on examples similar in size/complexity to `df` from my question. However, given that `ego` solution is based on a well-known open source library (and therefore exposed to more scrutiny i.e. more likely to be robust) I will accept the answer based on `igraph`. – arg0naut91 Feb 11 '22 at 14:24