Finding indirect nodes for every edge (in R)

Question

I have information on groups of physicians working together in given hospitals. A physician can work in more than one hospital at the same time. I would like to write a code that outputs information of all indirect colleagues of a given physician working in a given hospital. For instance, if I work in a given hospital with another physician who also works in another hospital, I would like to know who are the physicians with whom my colleague works in this other hospital.

Consider a simple example of three hospitals (1, 2, 3) and five physicians (A, B, C, D, E). Physicians A, B and C work together in hospital 1. Physicians A, B and D work together in hospital 2. Physicians B and E work together in hospital 3.

For each physician working in a given hospital I would like information of their indirect colleagues through each of their direct colleagues. For example, physician A has one indirect colleague through physician B in hospital 1: this is physician E in hospital 3. On the other hand, physician B does not have any indirect colleague through physician A in hospital 1. Physician C has two indirect colleagues through physician B in hospital 1: they are physician D in hospital 2 and physician E in hospital 3. And so on..

Below is the object that describes the nertworks of physicians in all hospitals:

edges <- tibble(hosp  = c("1", "1", "1", "1", "1", "1", "2", "2", "2", "2", "2", "2", "3", "3"), 
             from = c("A", "A", "B", "B", "C", "C", "A", "A", "B", "B", "D", "D", "B", "E"), 
             to   = c("C", "B", "C", "A", "B", "A", "D", "B", "A", "D", "A", "B", "E", "B")) %>% arrange(hosp, from, to)

I would like a code that produces the following output:

output <- tibble(hosp     = c("1", "1", "1", "1", "1", "1", "1", "2", "2", "2", "2", "2", "2", "2", "3", "3", "3", "3", "3"), 
             from     = c("A", "A", "B", "B", "C", "C", "C", "A", "A", "B", "B", "D", "D", "D", "B", "E", "E", "E", "E"), 
             to       = c("C", "B", "C", "A", "B", "A", "B", "D", "B", "A", "D", "A", "B", "B", "E", "B", "B", "B", "B"),
             hosp_ind = c("" , "3", "" , "" , "2", "2", "3", "" , "3", "" , "" , "1", "1", "3", "" , "1", "1", "2", "2"),
             to_ind   = c("" , "E", "" , "" , "D", "D", "E", "" , "E", "" , "" , "C", "C", "E", "" , "A", "C", "A", "D")) %>% arrange(hosp, from, to)

score 3 · Accepted Answer · answered Apr 08 '21 at 21:32

3

Here is one option using igraph + data.table

library(igraph)
library(data.table)

g <- simplify(graph_from_data_frame(edges, directed = FALSE))
res <- setDT(edges)[
  ,
  c(.SD, {
    to_ind <- setdiff(
      do.call(
        setdiff,
        Map(names, ego(g, 2, c(to, from), mindist = 2))
      ), from
    )
    if (!length(to_ind)) {
      hosp_ind <- to_ind <- NA_character_
    } else {
      hosp_ind <- lapply(to_ind, function(v) names(neighbors(g, v)))
    }
    data.table(
      hosp_ind = unlist(hosp_ind),
      to_ind = rep(to_ind, lengths(hosp_ind))
    )
  }),
  .(id = seq(nrow(edges)))
][, id := NULL][]

and you will obtain

> res
    hosp from to hosp_ind to_ind
 1:    1    A  B        3      E
 2:    1    A  C     <NA>   <NA>
 3:    1    B  A     <NA>   <NA>
 4:    1    B  C     <NA>   <NA>
 5:    1    C  A        2      D
 6:    1    C  B        2      D
 7:    1    C  B        3      E
 8:    2    A  B        3      E
 9:    2    A  D     <NA>   <NA>
10:    2    B  A     <NA>   <NA>
11:    2    B  D     <NA>   <NA>
12:    2    D  A        1      C
13:    2    D  B        1      C
14:    2    D  B        3      E
15:    3    B  E     <NA>   <NA>
16:    3    E  B        1      A
17:    3    E  B        2      A
18:    3    E  B        1      C
19:    3    E  B        2      D

Also, when you run plot(g), you will see the graph like below

answered Apr 08 '21 at 21:32

ThomasIsCoding

96,636
9
24
81

Thanks so much @ThomasIsCoding, there is a small misunderstanding. There should be 5 nodes (A,B,C,D,E) and not 8 nodes. 1, 2 and 3 refer to the different hospitals where the links are formed. I tried correcting it when defining the object g `g <- graph_from_data_frame(select(edges,to,from), vertices = c("A","B","C","D","E"), directed = FALSE)` but then the object `res` gets wrong. – PaulaSpinola Apr 09 '21 at 13:20
@PaulaSpinola You need to put 5 nodes + 3 hospital nodes together in the same graph, since hospitals are important vertices containing the association relations among nodes. – ThomasIsCoding Apr 09 '21 at 13:24
Many thanks @ThomaslsCoding. I am studying your code very carefully :) – PaulaSpinola Apr 09 '21 at 16:59
@ThomaslsCoding, would you mind shedding some light on (i) how you can assign new objects and create a data.table inside c() & (ii) how you can make "by" refer to a unexisting variable (i.e., id)? I usually thought of data.table aggregating objects but in this case we are expanding the original object (i.e., edges). – PaulaSpinola Apr 09 '21 at 17:05
1

@PaulaSpinola (i) Since `.SD` and the result in `{}` are both `data.table`s,, we can use `c(...)` to concatenate them. (ii) `id` is a auxiliary variable that helps slice the `data.table` by rows and run the function row-wisely. That's why I remove `id` at the last step. – ThomasIsCoding Apr 09 '21 at 19:40
@ThomaslsCoding (i) Shouldn't new variables in data.table we added with ":=" instead of "<-"? (ii) would you recommend any reference to using auxiliary variables for the by argument that I can look at?. (iii) I am still trying to get my head around it as I will have to adapt it to my data that more complex (I have a temporal dimension). Do you think it may be easier to write a code with dplyr? – PaulaSpinola Apr 09 '21 at 19:59
1

@PaulaSpinola (i) The reason I didn't use `:=` is that, we create data.table within `{...}`, where some steps are intertwined. In this case, it would be easier for me to use `c(...)` for concatenation. Also, `:=` will show the newly created variables only, not including the other column information from the original data.table (ii) I don't have any reference for that, but I believe you will learn it by trying and reading others' code (iii) I guess `dplyr` would be a good option if you many steps, but I cannot help you on that since I have very little experience with `dplyr`. – ThomasIsCoding Apr 09 '21 at 20:14
@ThomaslsCoding thanks! Would you know why this `ego(g, 2, c(to, from), mindist = 2)` doesn't work on its own? I get the following error message **Error in as.igraph.vs(graph, nodes) : object 'to' not found**. I think it has to do with R recognizing as vertices the columns `hosp` & `from` instead of `from` & `to` in object `edges` – PaulaSpinola Apr 09 '21 at 21:27
@PaulaSpinola `from` and `to` are used within the environment of `setDT(edges)`, something like `with(edges,...)`. Otherwise, you should use `edges$from` and `edges$to` – ThomasIsCoding Apr 09 '21 at 22:06

Finding indirect nodes for every edge (in R)

1 Answers1

Linked