I have a program in Ruby 1.9.3 that constructs a RubyTree. My data is best described as being a Directed Acyclic Graph (DAG); please note that it is not a polytree. Well, at least the data should be a DAG, despite users' best efforts to foil my program with bad data.
I build the DAG dynamically by parsing an XML document. The XML document does not explicitly specify the tree structure, but does provide cross-references of integer IDs that establish links between elements in the document.
I need to ensure that the RubyTree does not contain any cycles. The source data may (erroneously) have a cycle, and if it does, my program needs to be aware of it, rather than entering an infinite loop or crashing. To accomplish this currently, I mixed in the Ruby standard library's TSort module into RubyTree's Tree::TreeNode
class. This uses Tarjan's algorithm to perform a topological sort on the graph each time a node is added. During the topological sort, if a cycle is detected, an exception is raised -- exactly what I want.
For example:
module Tree
class TreeNode
include TSort
def tsort_each_node(&block)
self.each(&block)
end
def tsort_each_child(node, &block)
node.get_children().each { |child| yield child }
end
def add(child, at_index = -1)
#The standard RubyTree implementation of add goes here
begin
self.tsort()
rescue TSort::Cyclic => exce
self.remove!(child)
raise exce
end
return child
end
end
end
I had to modify a few other methods too. Basically anything that needs to traverse the tree or the children needed to have TSort implemented, OR get rid of its dependence on traversing (for instance, I simplified Tree::TreeNode#to_s()
to return Tree::TreeNode#name
.)
Right now, my program is functionally correct. I've done significant testing and the results are working fine: all I have to do is rescue TSort::Cyclic
at the right points in my code, and if I ever try to add a node that causes a cycle, the node gets removed and I can log the problem in a report to deal with later (by fixing the source data).
Problem is, on a RubyTree of size 75000 or so, where the number of edges is very close to being equal to the number of vertices minus 1, iteratively running Tarjan's algorithm produces an algorithmic complexity that looks pretty quadratic. Tarjan's itself is O(|V| + |E|)
, which in my case is about O(2*|V|)
, but each time I call add()
, |V|
increases by 1, as I build up the graph node by node. I can't simply call Tarjan's at the end, because I might need to traverse the graph, or parts of it, during the read-compare-add loop, and any traversal attempt may hang the program or crash it if there is in fact a cycle. (It goes without saying that my code is single-threaded; if it weren't, we'd have a huge problem. As it stands, I rely upon the fact that add()
never returns without raising an exception if there's a cycle, and even if there is a cycle, a node gets removed in such a way as to clear out the cycle before add()
returns.)
But it's too slow! It's taking more than half an hour just for this one loop, and my program consists of several other steps that take their own fair share of time. But as it stands, just performing Tarjan's is eating up the lion's share of the performance, judging from the results of ruby-perf
. I tried switching from arrays to linked lists in the RubyTree each
implementation, but it only cut down the runtime by about 1% by removing loads of Array#concat
calls.
I found this awesome paper by the Tarjan who invented the strongly connected components algorithm that Ruby's TSort
relies upon, and it seems that incremental cycle detection is an active area of research. However, the level of dialogue in the paper is significantly over my head, and I'm pretty sure I lack the mathematical background to translate the paper's findings into Ruby code. Not only that, but by reading the Remarks section of the paper, it seems that their best effort algorithms have rather worrisome worst-case runtimes of their own, so it may not even be faster than my current method, depending on the specifics of my data.
Am I missing something silly here, or is my best bet to put in the mental legwork to analyze Tarjan's paper and try to come up with a Ruby implementation of one of the algorithms? Note that I don't particularly care about the topological sorting aspect of the algorithm; it's a side effect of what I really want. If the tree were not topologically sorted but still had a guarantee of no cycles, I would be perfectly happy.
Also worth noting is that cycles are somewhat rare in my source data. That is, cycles can happen due to manual error in a data entry process, but they would never happen intentionally, and should always be reported to the program so it can tell me, so I can beat someone over the head with a billyclub for entering the wrong data. Additionally, the program absolutely must keep on chugging even if it detects a particularly nasty cycle, so I can't just stick my head in the sand and hope there won't be any cycles.
What's the actual problem?
As requested by some, here's a demonstration you can run to see the issue at work.
Install the stable version of RubyTree (I use MRI 1.9.3). Then compare the output of these two programs:
Exhibit 1: Hangs forever with 100% CPU usage on the main thread after "Third time" is printed
require 'tree'
a = Tree::TreeNode.new('a', nil)
b = Tree::TreeNode.new('b', nil)
c = Tree::TreeNode.new('c', nil)
a.add(b)
a.add(c)
puts "First time"
b.add(c)
puts "Second time"
b.add(a)
puts "Third time"
c.add(b)
puts "Fourth time"
c.add(a)
puts "Fifth time"
puts "Done"
Exhibit 2: Goes all the way through and prints "Done", and the result has no cycle
Note that I would normally be doing stuff within the rescue
blocks to log that cycle(s) happened, and complaining loudly at the human offenders who created these cycles.
require 'tree'
require 'tsort'
module Tree
class TreeNode
include TSort
def tsort_each_node(&block)
self.each(&block)
end
def tsort_each_child(node, &block)
node.get_children().each { |child| yield child}
end
def to_s
name
end
def get_children()
return @children
end
def add(child, at_index = -1)
unless child
raise ArgumentError, "Attempting to add a nil node" # Only handles the immediate child scenario
end
if self.equal?(child)
raise TSort::Cyclic, "Cycle detected: [#{child.name}, #{child.name}]"
end
# Lazy man's unique test, won't test if children of child are unique in this tree too.
if @children_hash.include?(child.name)
raise "Child #{child.name} already added!"
end
if insertion_range.include?(at_index)
@children.insert(at_index, child)
else
raise "Attempting to insert a child at a non-existent location (#{at_index}) when only positions from #{insertion_range.min} to #{insertion_range.max} exist."
end
@children_hash[child.name] = child
child.parent = self
#CYCLE DETECTION - raises TSort::Cyclic if this caused a cycle
begin
self.tsort()
rescue TSort::Cyclic => exce
self.remove!(child)
raise exce
end
return child
end
end
end
a = Tree::TreeNode.new('a', nil)
b = Tree::TreeNode.new('b', nil)
c = Tree::TreeNode.new('c', nil)
a.add(b)
a.add(c)
puts "First time"
b.add(c)
puts "Second time"
begin
b.add(a)
rescue
end
puts "Third time"
begin
c.add(b)
rescue
end
puts "Fourth time"
begin
c.add(a)
rescue
end
puts "Fifth time"
puts "Done"
The goal, for me, is to develop code that is functionally equivalent to Exhibit 2, but scales better to larger numbers of vertices (I don't anticipate having more than 10^6 vertices, and in that case I would be fine with it taking several minutes ("go get a cup of coffee") on a modern desktop workstation, but not hours or longer.)