Faster cycle detection in a Directed Acyclic Graph?

Question

I have a program in Ruby 1.9.3 that constructs a RubyTree. My data is best described as being a Directed Acyclic Graph (DAG); please note that it is not a polytree. Well, at least the data should be a DAG, despite users' best efforts to foil my program with bad data.

I build the DAG dynamically by parsing an XML document. The XML document does not explicitly specify the tree structure, but does provide cross-references of integer IDs that establish links between elements in the document.

I need to ensure that the RubyTree does not contain any cycles. The source data may (erroneously) have a cycle, and if it does, my program needs to be aware of it, rather than entering an infinite loop or crashing. To accomplish this currently, I mixed in the Ruby standard library's TSort module into RubyTree's Tree::TreeNode class. This uses Tarjan's algorithm to perform a topological sort on the graph each time a node is added. During the topological sort, if a cycle is detected, an exception is raised -- exactly what I want.

For example:

module Tree
  class TreeNode
    include TSort

    def tsort_each_node(&block)
      self.each(&block)
    end

    def tsort_each_child(node, &block)
      node.get_children().each { |child| yield child }
    end

    def add(child, at_index = -1)
      #The standard RubyTree implementation of add goes here
      begin
        self.tsort()
      rescue TSort::Cyclic => exce
        self.remove!(child)
        raise exce
      end
      return child
    end
  end
end

I had to modify a few other methods too. Basically anything that needs to traverse the tree or the children needed to have TSort implemented, OR get rid of its dependence on traversing (for instance, I simplified Tree::TreeNode#to_s() to return Tree::TreeNode#name.)

Right now, my program is functionally correct. I've done significant testing and the results are working fine: all I have to do is rescue TSort::Cyclic at the right points in my code, and if I ever try to add a node that causes a cycle, the node gets removed and I can log the problem in a report to deal with later (by fixing the source data).

Problem is, on a RubyTree of size 75000 or so, where the number of edges is very close to being equal to the number of vertices minus 1, iteratively running Tarjan's algorithm produces an algorithmic complexity that looks pretty quadratic. Tarjan's itself is O(|V| + |E|), which in my case is about O(2*|V|), but each time I call add(), |V| increases by 1, as I build up the graph node by node. I can't simply call Tarjan's at the end, because I might need to traverse the graph, or parts of it, during the read-compare-add loop, and any traversal attempt may hang the program or crash it if there is in fact a cycle. (It goes without saying that my code is single-threaded; if it weren't, we'd have a huge problem. As it stands, I rely upon the fact that add() never returns without raising an exception if there's a cycle, and even if there is a cycle, a node gets removed in such a way as to clear out the cycle before add() returns.)

But it's too slow! It's taking more than half an hour just for this one loop, and my program consists of several other steps that take their own fair share of time. But as it stands, just performing Tarjan's is eating up the lion's share of the performance, judging from the results of ruby-perf. I tried switching from arrays to linked lists in the RubyTree each implementation, but it only cut down the runtime by about 1% by removing loads of Array#concat calls.

I found this awesome paper by the Tarjan who invented the strongly connected components algorithm that Ruby's TSort relies upon, and it seems that incremental cycle detection is an active area of research. However, the level of dialogue in the paper is significantly over my head, and I'm pretty sure I lack the mathematical background to translate the paper's findings into Ruby code. Not only that, but by reading the Remarks section of the paper, it seems that their best effort algorithms have rather worrisome worst-case runtimes of their own, so it may not even be faster than my current method, depending on the specifics of my data.

Am I missing something silly here, or is my best bet to put in the mental legwork to analyze Tarjan's paper and try to come up with a Ruby implementation of one of the algorithms? Note that I don't particularly care about the topological sorting aspect of the algorithm; it's a side effect of what I really want. If the tree were not topologically sorted but still had a guarantee of no cycles, I would be perfectly happy.

Also worth noting is that cycles are somewhat rare in my source data. That is, cycles can happen due to manual error in a data entry process, but they would never happen intentionally, and should always be reported to the program so it can tell me, so I can beat someone over the head with a billyclub for entering the wrong data. Additionally, the program absolutely must keep on chugging even if it detects a particularly nasty cycle, so I can't just stick my head in the sand and hope there won't be any cycles.

What's the actual problem?

As requested by some, here's a demonstration you can run to see the issue at work.

Install the stable version of RubyTree (I use MRI 1.9.3). Then compare the output of these two programs:

Exhibit 1: Hangs forever with 100% CPU usage on the main thread after "Third time" is printed

require 'tree'

a = Tree::TreeNode.new('a', nil)
b = Tree::TreeNode.new('b', nil)
c = Tree::TreeNode.new('c', nil)
a.add(b)
a.add(c)
puts "First time"
b.add(c)
puts "Second time"
b.add(a)
puts "Third time"
c.add(b)
puts "Fourth time"
c.add(a)
puts "Fifth time"
puts "Done"

Exhibit 2: Goes all the way through and prints "Done", and the result has no cycle

Note that I would normally be doing stuff within the rescue blocks to log that cycle(s) happened, and complaining loudly at the human offenders who created these cycles.

require 'tree'
require 'tsort'

module Tree
  class TreeNode
    include TSort

    def tsort_each_node(&block)
      self.each(&block)
    end

    def tsort_each_child(node, &block)
      node.get_children().each { |child| yield child}
    end

    def to_s
      name
    end

    def get_children()
      return @children
    end

    def add(child, at_index = -1)
      unless child
        raise ArgumentError, "Attempting to add a nil node"  # Only handles the immediate child scenario
      end
      if self.equal?(child)
        raise TSort::Cyclic, "Cycle detected: [#{child.name}, #{child.name}]"
      end 

      # Lazy man's unique test, won't test if children of child are unique in this tree too.
      if @children_hash.include?(child.name)
        raise "Child #{child.name} already added!"
      end

      if insertion_range.include?(at_index)
        @children.insert(at_index, child)
      else
        raise "Attempting to insert a child at a non-existent location (#{at_index}) when only positions from #{insertion_range.min} to #{insertion_range.max} exist."
      end

      @children_hash[child.name] = child
      child.parent = self

      #CYCLE DETECTION - raises TSort::Cyclic if this caused a cycle
      begin
        self.tsort()
      rescue TSort::Cyclic => exce
        self.remove!(child)
        raise exce
      end
      return child
    end
  end
end

a = Tree::TreeNode.new('a', nil)
b = Tree::TreeNode.new('b', nil)
c = Tree::TreeNode.new('c', nil)
a.add(b)
a.add(c)
puts "First time"
b.add(c)
puts "Second time"
begin
  b.add(a)
rescue
end
puts "Third time"
begin
  c.add(b)
rescue
end
puts "Fourth time"
begin
  c.add(a)
rescue
end
puts "Fifth time"
puts "Done"

The goal, for me, is to develop code that is functionally equivalent to Exhibit 2, but scales better to larger numbers of vertices (I don't anticipate having more than 10^6 vertices, and in that case I would be fine with it taking several minutes ("go get a cup of coffee") on a modern desktop workstation, but not hours or longer.)

1) Do you have any control over the integer IDs (e.g. by specifying the format), or are they handed down arbitrarily? 2) Have you tried other cycle detection methods than just using Tarjan's SCC/toposort algorithm? You might want to try one of the tortoise-and-hare algorithms... — , Oct 07 '14 at 22:24
"any traversal attempt may hang the program or crash it if there is in fact a cycle" Could you expand on this? I think that the most promising way forward is to make this code more robust. — David Eisenstat, Oct 08 '14 at 15:09
@DavidEisenstat Thanks to your insightful comments and the deleted answer, I've added some stuff that you can "try at home" to the end of my question, which should at least help you reproduce the actual problem (Exhibit 1) and my hack-job of a solution that's too slow (Exhibit 2). — allquixotic, Oct 08 '14 at 15:49
I didn't run your example, but the problem appears to be on this line: https://github.com/evolve75/RubyTree/blob/master/lib/tree.rb#L358 The method [`root`](https://github.com/evolve75/RubyTree/blob/master/lib/tree.rb#L126) does not check for preexisting cycles. Perhaps you should find a proper graph library. — David Eisenstat, Oct 08 '14 at 16:07
Looks like what I really needed was [Plexus](https://github.com/chikamichi/plexus) -- a fork of a fork of a port :-) In fact the fork history of Plexus makes an interesting graph, itself: Plexus -> Graphy (several forks) -> GRATR (several forks) ->(spiritual port of) Boost Graphs Library (C++). People who maintain these libs just love to disappear / stop maintaining them, so there are tons of forks that get maintained temporarily when projects come up that need them :/ — allquixotic, Oct 08 '14 at 20:07
@allquixotic Could you turn that into an answer, so that people with the same kind of problem could just follow your footsteps? :) — , Oct 08 '14 at 20:18
On a tiny additional note, if you need to check indeed for identical objects, not just objects with identical names, then some performance improvement may result from using `child.__id__` (Fixnum) instead of `child.name` (String) as a key for `@children_hash`. Sorry, didn't try to actually test it, though. If somehow `hash` method is faster for Fixnums than for Strings, that may be of help. If not, comparing identical objects (Strings of Fixnums - doesn't matter) is fast. If I remember it correctly that's the first thing the Hash code checks after finding potentially identical keys. — moonfly, Oct 08 '14 at 23:48
Actually, just did a very simple experiment, comparing `hash = {}; hash[a] = true; t1 = Time.now; 10000.times { hash.include?(a) }; t2 = Time.now; t2-t1` to `hash = {}; hash[a.__id__] = true; t1 = Time.now; 10000.times { hash.include?(a.__id__) }; t2 = Time.now; t2-t1` for a = "a". No difference :(. If you compare performance of the 1st method for `a="a"` and `a=1` there is a 2x-1.5x improvement, but calling `__id__` eats all the difference. So, please disregard my previous comment. — moonfly, Oct 08 '14 at 23:54
I would have to agree with the sentiment. For a single use, it's probably less work to reinvent the wheel yourself than bust out the debugger and profiling fire graphs (although it is nice to provided actionable troubleshooting feedback when you can). I had a need for a trivial memoized Tarjan implementation to determine if adding an edge to a DAG would introduce a cycle, which wouldn't justify introducing the risk of adding other people's code. — , Jan 28 '15 at 19:59
@moonfly Checkout [absolute_time](https://github.com/bwbuchanan/absolute_time/) for Ruby on Linux/FreeBSD/OSX. It's a Benchmark replacement that's far more precise by using compensated `rdtsc`. — , Jan 28 '15 at 20:01

score 4 · Accepted Answer · answered Oct 09 '14 at 16:16

The Plexus gem for Ruby seems to have resolved the worst of my problems. I tried GRATR before, but it wouldn't load because it wasn't compatible with Ruby 1.9.3, but Plexus is a fork of a fork of GRATR that works with 1.9.3.

My problem was that I was using a data structure (RubyTree) that wasn't designed to handle cycles, but the Plexus Digraph can actually keep working with cycles. The API is designed with them in mind.

The solution I went with is pretty simple: basically, now that my graph data structure doesn't hang on cycles, I can just call Tarjan's algorithm at the end of the graph building routine -- actually, there's a nice wrapper acyclic? method, but it just calls topsort() under the hood, and implements the topological sort using Tarjan's strongly connected components algorithm, much like Ruby's stdlib's TSort. It does use its own implementation instead of TSort, though. I am unsure why.

Unfortunately, now I'm stuck with the challenge of developing an implementation of the minimum feedback arc set problem (minimum FAS problem) which is NP-hard. The minimum FAS problem is required because I need to remove the least intrusive number of arcs in the graph to make it acyclic.

My plan right now is to get the strongly connected components list from Plexus, which is an array of arrays; if any of the second-level arrays contain more than one element, then that array describes an element with a cycle, based on the definition of a strongly connected component. I then have to (using minimum FAS or an approximation) remove edges and/or vertices to make the graph acyclic, and iteratively run Tarjan's until every SCC subarray has length 1.

I'm thinking brute force is probably the best approach to solving minimum FAS: I don't need to be too clever because the number of nodes in any SCC in my data set will almost never exceed, say, 5 or 6. Exponential on 5 or 6 is fine. I severely doubt I'll have an SCC set of many hundreds of nodes with tens of different cycles in it; that would be an extremely pathological worst case that I do not think will ever happen. If it did, though, the runtime would be quite long.

Basically I need to try removing the power set of the arcs of the graph, one subset at a time with the set of subsets sorted ascending by subset size, and "guess and check" whether the graph is still cyclic (Tarjan's), then add the edges back if that power set doesn't fix the cycle.

If the number of edges and nodes is under 20 or so, which is almost guaranteed, it won't take significant runtime.

Removing the iterative Tarjan's definitely solved my complexity problem in the happy path (no cycles or just 1 trivial cycle), which is really where it was giving me the most heartburn -- instead of taking 25 minutes to build the graph, it takes 15 seconds.

Lesson learned: if your program is slow, it's probably because you're doing lots of unnecessary work. In my case, the unnecessary work was performing Tarjan's topological sort upon each addition of a new vertex to the graph, which was only required because of implementation details of the library I originally chose to model my data.

Faster cycle detection in a Directed Acyclic Graph?

What's the actual problem?

Exhibit 1: Hangs forever with 100% CPU usage on the main thread after "Third time" is printed

Exhibit 2: Goes all the way through and prints "Done", and the result has no cycle

1 Answers1