1

I have a network of more than 4,000 nodes and I have a list of edges (connections between pairs of nodes). All nodes should converge to a single central point, but I have no way of ordering the nodes, since they are not numbered or labeled in a way that it is feasible to reorder them.

What I need?: Based on the small example attached, I need all the nodes point to node F (F is reachable from all nodes), so that the undirected graph becomes a directed graph (DAG) and that as a restriction there is only a single edge between each node pair. I am allowed to remove edges if and only if it is to remove loops (eg A -> B, B <- A). I can't add edges either, as this is a real network and I can't create connections where they don't exist.

What I have is this:

 library(igraph)
 library(tidygraph)
 library(ggraph)
 library(tidyverse)

 # edge list
 edgelist <- tribble(
  ~from, ~to,
  "A", "B",
  "A", "C",
  "B", "D",
  "C", "D",
  "C", "E",
  "D", "E",
  "D", "F")
 
 # create the graph
 g <- as_tbl_graph(edgelist)
 
 # undirected graph 
 g %>% 
  ggraph(layout = "graphopt") +
  geom_edge_link() +
  geom_node_point(shape = 21, size = 18, fill = 'white') +
  geom_node_text(aes(label = name), size = 3) +
  theme_graph() 

This is the procedure I came up with to do the sorting so that the list of edges would become a DAG:

 s <- names(V(g))
 
 # define node objective
 target <- "F"
 
 # exclude target from vertex list
 vertex_list <- s[s != target]
 
 # calculate the simple path of each node to the destination node (target)
 route_list <- map(vertex_list, ~ all_simple_paths(graph = g, 
                                                   from = .x,
                                                   to = target)) %>% 
  set_names(vertex_list) %>% 
  map(~ map(., ~ names(.x))) %>%
  flatten() %>% 
  map(~ str_c(.x, collapse = ","))
 
 
 # generate the list of ordered edges
 ordered_edges <- do.call(rbind, route_list) %>% 
  as.data.frame(row.names = F) %>%  
  set_names("chain") %>% 
  group_by(chain) %>% 
  summarise(destination = str_split(chain, ","), .groups = "drop") %>% 
  mutate(
   
   from = map(destination, ~ lag(.x)) %>% 
    map(~ .x[!is.na(.x)]),
   
   to = map(destination, ~ lead(.x)) %>% 
    map(~ .x[!is.na(.x)]),
  ) %>% 
  
  select(from, to) %>% 
  unnest(cols = everything()) %>% 
  group_by(across(everything())) %>% 
  summarise(enlaces = n(), .groups = "drop") %>% 
  select(-enlaces)

Warning: When the number of nodes is of a certain size (let's say 90), this algorithm generates loops that make the graph non-acyclic, so an additional procedure I do is apply a function in Python called feedback_arc_set to remove the edges that will make the graph a DAG.

For simplicity, I am not including the necessary code to remove these loops, since in this specific example no loops are generated.

 # draw the graph again
 as_tbl_graph(ordered_edges) %>% 
  ggraph(layout = "graphopt") +
  geom_edge_link(arrow = arrow(length = unit(3, 'mm'),
                               type = "closed", 
                               angle = 30),
                 end_cap = circle(7, 'mm')) +
  geom_node_point(shape = 21, size = 18, fill = 'white') +
  geom_node_text(aes(label = name), size = 3) +
  theme_graph() 

Created on 2021-07-07 by the reprex package (v2.0.0)

So what is the problem?: The complexity of the algorithm when the number of nodes is greater than 2000

If I try to do this with say 2000 nodes the algorithm never ends. I left it running for 24 hours and it didn't finish. In fact, I didn't find a way to know if it was working. In this place I found that the function of {igraph} all_simple_paths uses DFS behind the scenes, but complexity is O (|V|!) where |V| is the number of vertices and |V|! is the factorial of the number of vertices.

Is there a way to do this with less complexity?

William
  • 164
  • 11

2 Answers2

1

There is no way of avoiding doing a DFS. However the problem is not due to the complexity of the DFS algorithm. I can do a DFS on a graph of 403,394 nodes and 3,387,388 links in less than a second https://github.com/JamesBremner/PathFinder2/wiki/Performance

The likely problem is that your algorithm calls for a huge number of DFS's to be performed.

I would suggest the following algorithm, which should run in a second or so with a graph of moderate size such as 4,000 nodes.

The first thing you need to do is check that every node is reachable from F. A single DFS starting from F will tell you this. If every node is not reachable, the problem is not solvable without adding edges.

Now, traverse the paths to determine which direction each link should have. Note that any edge that is not traversed is unnecessary and can be removed - thus preventing the "accidental" introduction of loops

Note that if you have a decent DFS implementation that allows you to specify a visitor, you can do this in one step, marking the direction of edges as the DFS goes along. All that remains is to remove unnecessary edges that have not been visited. The whole thing will then run in a flash on a 4,000 node graph.

===

Any interest in an application to give a fast solution to this problem? Running on a MSWindows machine, written in C++17, based on the PathFinder class, guaranteed performance > 1,000 nodes/second?

ravenspoint
  • 19,093
  • 6
  • 57
  • 103
  • I was reviewing your application, but sorry for my ignorance, I could not find how to install the application. I understand that it runs on Windows with a GUI – William Jul 08 '21 at 21:35
  • The application is a demo of several different problems that can be solved using the PathFinder graph class. You have to download and build the source. I was wondering if you were interested in my developing a similar application, but specialized to solve your particular problem. – ravenspoint Jul 08 '21 at 23:36
  • You mean in a business sense to pay for the software or for your services? – William Jul 09 '21 at 00:17
  • Yes. Would you be interested? – ravenspoint Jul 09 '21 at 02:10
  • Well, my company hired an inexperienced programmer like me to avoid paying for the services of a specialist like you, so even if I wanted to, I have budget restrictions and this is why I am here on stackoverflow (lol). Thanks for your help. – William Jul 12 '21 at 17:18
  • A false economy! What would your budget be? Is your 4,000 node network available somewhere so I could run my algorithm on it? – ravenspoint Jul 12 '21 at 17:28
  • I would totally agree with you if it were not for the fact that I have benefited from that decision. No my friend, unfortunately the data is not available. Thanks again. – William Jul 12 '21 at 18:27
  • So what do you want to do? I have described the algorithm in detail. 5 days later you still have not implemented it! Are you going to continue wasting time and money? I suggest you accept my implementation and move on to some new problem. – ravenspoint Jul 12 '21 at 19:09
  • As you see, I am testing the ThomasIsCoding implementation, and although your answer is good, it has no code that I can test. I just hope to finish a few tests before placing Thomas's answer as the best. – William Jul 12 '21 at 20:07
1

Quick Answer

Actually you can split the vertices into groups in terms of distances to "F", and then check the neighborhood between nodes from two adjacent groups to add the edges.


Idea Behind

With respect to the distances to "F", the idea comes from the facts that:

  1. If a node is with distance d, then its parent(s) must be with distance d+1.
  2. If X is with distance d+1, then the nodes with distance d must be the children of X if and only if they are neighbors of X.

My Attempt

D <- distances(g)
d <- distances(g, "F")
lst <- split(colnames(d), d)
lst <- lst[order(as.integer(names(lst)))]
res <- c()
for (k in head(seq_along(lst), -1)) {
    pre <- lst[[k]]
    nxt <- lst[[k + 1]]
    for (p in pre) {
        dp <- D[p, nxt, drop = FALSE]
        if (any(dp == 1)) {
            res[[length(res) + 1]] <- data.frame(
                from = colnames(dp)[dp == 1],
                to = p
            )
        }
    }
}
dag <- graph_from_data_frame(do.call(rbind, res))

then

plot(dag)

gives

enter image description here

ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
  • Hi @ThomasIsCoding. This works really well with an initial test of 2000 nodes under some specific scenarios. I will try to validate it against a larger number of nodes and include situations that could arise, but it looks promising. Thanks!! – William Jul 09 '21 at 00:12
  • @William You are welcome. Leave me a message if you have any problem. – ThomasIsCoding Jul 09 '21 at 20:32